Evaluate

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "What is your refund policy?",
    "output": "Our refund policy allows returns within 30 days of purchase for a full refund.",
    "expected_behavior": "Should accurately describe the refund policy",
    "context": ["Refund Policy Doc: Returns accepted within 30 days of purchase for full refund. Excludes sale items."],
    "ground_truth": "Returns are accepted within 30 days of purchase for a full refund.",
    "scorers": [
      "answer_relevancy",
      "faithfulness",
      "contextual_relevancy",
      "bias",
      "toxicity"
    ],
    "threshold": 0.8
  }'

from playgent import Playgent

client = Playgent(api_key="your-api-key")

evaluation = client.evaluate(
    input="What is your refund policy?",
    output="Our refund policy allows returns within 30 days...",
    expected_behavior="Should accurately describe the refund policy",
    context=["Refund Policy Doc: Returns accepted within 30 days..."],
    ground_truth="Returns are accepted within 30 days for a full refund.",
    scorers=[
        "answer_relevancy",
        "faithfulness",
        "contextual_relevancy",
        "bias",
        "toxicity"
    ],
    threshold=0.8
)

print(f"Overall pass: {evaluation.overall_pass}")
for scorer, result in evaluation.results.items():
    print(f"{scorer}: {result.score:.2f} - {result.reasoning}")

import { Playgent } from "playgent";

const client = new Playgent({ apiKey: "your-api-key" });

const evaluation = await client.evaluate({
  input: "What is your refund policy?",
  output: "Our refund policy allows returns within 30 days...",
  expectedBehavior: "Should accurately describe the refund policy",
  context: ["Refund Policy Doc: Returns accepted within 30 days..."],
  groundTruth: "Returns are accepted within 30 days for a full refund.",
  scorers: [
    "answer_relevancy",
    "faithfulness",
    "contextual_relevancy",
    "bias",
    "toxicity",
  ],
  threshold: 0.8,
});

console.log(`Overall pass: ${evaluation.overallPass}`);

{
  "evaluation_id": "eval_mno345",
  "overall_pass": true,
  "results": {
    "answer_relevancy": {
      "score": 0.95,
      "pass": true,
      "reasoning": "Response directly addresses the question about refund policy. All key information is present and relevant.",
      "metadata": {}
    },
    "faithfulness": {
      "score": 0.88,
      "pass": true,
      "reasoning": "All claims in the response are supported by the provided context. The 30-day timeframe and full refund details match the policy document.",
      "metadata": {
        "claims_analyzed": [
          {
            "claim": "Returns within 30 days",
            "supported": true,
            "evidence": "Doc states 'Returns accepted within 30 days'"
          },
          {
            "claim": "Full refund",
            "supported": true,
            "evidence": "Doc states 'for full refund'"
          }
        ],
        "unsupported_claims": []
      }
    },
    "contextual_relevancy": {
      "score": 0.92,
      "pass": true,
      "reasoning": "The retrieved context is highly relevant. Contains the exact policy information needed to answer the question.",
      "metadata": {
        "relevant_chunks": 1,
        "total_chunks": 1,
        "relevancy_ratio": 1.0
      }
    },
    "bias": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No biased language detected. Response treats all users fairly and does not discriminate based on protected characteristics.",
      "metadata": {
        "flags": []
      }
    },
    "toxicity": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No toxic or harmful language detected. Response is professional and appropriate.",
      "metadata": {
        "toxicity_level": "none"
      }
    }
  }
}

POST

evaluate

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "What is your refund policy?",
    "output": "Our refund policy allows returns within 30 days of purchase for a full refund.",
    "expected_behavior": "Should accurately describe the refund policy",
    "context": ["Refund Policy Doc: Returns accepted within 30 days of purchase for full refund. Excludes sale items."],
    "ground_truth": "Returns are accepted within 30 days of purchase for a full refund.",
    "scorers": [
      "answer_relevancy",
      "faithfulness",
      "contextual_relevancy",
      "bias",
      "toxicity"
    ],
    "threshold": 0.8
  }'

from playgent import Playgent

client = Playgent(api_key="your-api-key")

evaluation = client.evaluate(
    input="What is your refund policy?",
    output="Our refund policy allows returns within 30 days...",
    expected_behavior="Should accurately describe the refund policy",
    context=["Refund Policy Doc: Returns accepted within 30 days..."],
    ground_truth="Returns are accepted within 30 days for a full refund.",
    scorers=[
        "answer_relevancy",
        "faithfulness",
        "contextual_relevancy",
        "bias",
        "toxicity"
    ],
    threshold=0.8
)

print(f"Overall pass: {evaluation.overall_pass}")
for scorer, result in evaluation.results.items():
    print(f"{scorer}: {result.score:.2f} - {result.reasoning}")

import { Playgent } from "playgent";

const client = new Playgent({ apiKey: "your-api-key" });

const evaluation = await client.evaluate({
  input: "What is your refund policy?",
  output: "Our refund policy allows returns within 30 days...",
  expectedBehavior: "Should accurately describe the refund policy",
  context: ["Refund Policy Doc: Returns accepted within 30 days..."],
  groundTruth: "Returns are accepted within 30 days for a full refund.",
  scorers: [
    "answer_relevancy",
    "faithfulness",
    "contextual_relevancy",
    "bias",
    "toxicity",
  ],
  threshold: 0.8,
});

console.log(`Overall pass: ${evaluation.overallPass}`);

{
  "evaluation_id": "eval_mno345",
  "overall_pass": true,
  "results": {
    "answer_relevancy": {
      "score": 0.95,
      "pass": true,
      "reasoning": "Response directly addresses the question about refund policy. All key information is present and relevant.",
      "metadata": {}
    },
    "faithfulness": {
      "score": 0.88,
      "pass": true,
      "reasoning": "All claims in the response are supported by the provided context. The 30-day timeframe and full refund details match the policy document.",
      "metadata": {
        "claims_analyzed": [
          {
            "claim": "Returns within 30 days",
            "supported": true,
            "evidence": "Doc states 'Returns accepted within 30 days'"
          },
          {
            "claim": "Full refund",
            "supported": true,
            "evidence": "Doc states 'for full refund'"
          }
        ],
        "unsupported_claims": []
      }
    },
    "contextual_relevancy": {
      "score": 0.92,
      "pass": true,
      "reasoning": "The retrieved context is highly relevant. Contains the exact policy information needed to answer the question.",
      "metadata": {
        "relevant_chunks": 1,
        "total_chunks": 1,
        "relevancy_ratio": 1.0
      }
    },
    "bias": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No biased language detected. Response treats all users fairly and does not discriminate based on protected characteristics.",
      "metadata": {
        "flags": []
      }
    },
    "toxicity": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No toxic or harmful language detected. Response is professional and appropriate.",
      "metadata": {
        "toxicity_level": "none"
      }
    }
  }
}

Evaluate agent responses using Playgent’s comprehensive suite of evaluation metrics. No setup required - access industry-standard RAG metrics (RAGAS), agentic workflow evaluations, multi-turn conversation analysis, and custom LLM-as-judge evaluations out of the box.

🎯 Built-in Evaluation Metrics

Playgent offers 29 evaluation metrics across five categories:

Custom Evaluations

Playval

General-purpose LLM-as-judge evaluation Evaluates response quality using a customizable rubric with GPT-4. Perfect for domain-specific quality assessment when standard metrics don’t apply. - Configurable evaluation criteria - Detailed reasoning output - Score from 0-1

RAG (Retrieval-Augmented Generation)

Answer Relevancy

Measures how well the response addresses the user’s question. Penalizes incomplete or off-topic answers.

Faithfulness

Ensures all claims in the response are supported by the provided context. Detects hallucinations and unsupported statements.

Contextual Precision

Evaluates whether relevant context chunks are ranked higher than irrelevant ones. Measures retrieval quality.

Contextual Recall

Checks if all necessary information from the ground truth is present in the retrieved context.

Contextual Relevancy

Measures the proportion of relevant information in the retrieved context. Penalizes noisy or irrelevant chunks.

Agentic Workflows

Task Completion

Evaluates whether the agent successfully completed the requested task end-to-end.

Tool Correctness

Verifies that the agent selected and executed the correct tools for the task.

Argument Correctness

Checks if tool/function arguments are accurate and properly formatted.

Step Efficiency

Measures if the agent completed the task with minimal unnecessary steps.

Plan Adherence

Evaluates how well the agent followed its planned sequence of actions.

Plan Quality

Assesses the quality of the agent’s initial plan before execution.

Safety & Security

Bias

Detects biased language or unfair treatment based on protected characteristics (race, gender, religion, etc.).

Toxicity

Identifies toxic, offensive, or harmful language in responses.

Non-Advice

Ensures the agent doesn’t provide advice in domains requiring professional expertise (legal, medical, financial).

Misuse

Detects attempts to misuse the agent for harmful purposes (disinformation, illegal activities, etc.).

PII Leakage

Checks if the response inappropriately reveals personally identifiable information (emails, phone numbers, addresses, SSNs).

Role Violation

Ensures the agent doesn’t break character or violate system-level instructions.

Multi-Turn Conversations

Turn Relevancy

Evaluates if each turn stays relevant to the ongoing conversation.

Role Adherence

Checks if the agent maintains its assigned persona and role throughout the conversation.

Knowledge Retention

Measures if the agent remembers and references information from earlier turns.

Conversation Completeness

Evaluates if all aspects of the user’s multi-part query were addressed across turns.

Goal Accuracy

Assesses whether the conversation achieves the user’s stated goal by the end.

Tool Use

Evaluates appropriate tool usage throughout the multi-turn interaction.

Topic Adherence

Checks if the conversation stays on-topic without unnecessary tangents.

Turn Faithfulness

Per-turn version of faithfulness - ensures each response is grounded in context.

Turn Contextual Precision

Evaluates contextual precision for each individual turn in the conversation.

Turn Contextual Recall

Measures contextual recall at each turn of the conversation.

Turn Contextual Relevancy

Evaluates contextual relevancy for each individual turn in the conversation.

Parameters

string

Evaluate an existing turn by ID

string

User input (for ad-hoc evaluation)

string

Agent output to evaluate

string

Description of expected behavior (used by Playval and other custom metrics)

array

Context documents for RAG evaluation metrics

string

Ground truth answer for correctness evaluation

array

Previous turns for multi-turn evaluation metrics

Show turn object

string

user or assistant

string

Turn content

string

Agent’s planned steps (for agentic metrics like plan_adherence)

array

Tool calls made by the agent (for agentic metrics)

Show tool call object

string

Tool name

object

Tool arguments

any

Tool result

array

required

Array of metric names to evaluate. Choose from: Custom: playval RAG: answer_relevancy, faithfulness, contextual_precision, contextual_recall, contextual_relevancy Safety: bias, toxicity, non_advice, misuse, pii_leakage, role_violation Agentic: task_completion, tool_correctness, argument_correctness, step_efficiency, plan_adherence, plan_quality Multi-Turn: turn_relevancy, role_adherence, knowledge_retention, conversation_completeness, goal_accuracy, tool_use, topic_adherence, turn_faithfulness, turn_contextual_precision, turn_contextual_recall, turn_contextual_relevancy Or use custom scorer IDs created via Create Custom Scorer

number

Minimum passing score (default: 0.7)

Response

string

required

Unique evaluation identifier

boolean

required

Whether all scorers passed the threshold

object

required

Per-scorer results

Show scorer result

number

Score (0-1)

boolean

Pass/fail based on threshold

string

Detailed explanation of the score

object

Additional metric-specific data

Show examples

array

For faithfulness: individual claims with evidence

array

For faithfulness: claims without supporting context

array

For tool_correctness: tools that were called

array

For plan_adherence: steps that were skipped

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "What is your refund policy?",
    "output": "Our refund policy allows returns within 30 days of purchase for a full refund.",
    "expected_behavior": "Should accurately describe the refund policy",
    "context": ["Refund Policy Doc: Returns accepted within 30 days of purchase for full refund. Excludes sale items."],
    "ground_truth": "Returns are accepted within 30 days of purchase for a full refund.",
    "scorers": [
      "answer_relevancy",
      "faithfulness",
      "contextual_relevancy",
      "bias",
      "toxicity"
    ],
    "threshold": 0.8
  }'

from playgent import Playgent

client = Playgent(api_key="your-api-key")

evaluation = client.evaluate(
    input="What is your refund policy?",
    output="Our refund policy allows returns within 30 days...",
    expected_behavior="Should accurately describe the refund policy",
    context=["Refund Policy Doc: Returns accepted within 30 days..."],
    ground_truth="Returns are accepted within 30 days for a full refund.",
    scorers=[
        "answer_relevancy",
        "faithfulness",
        "contextual_relevancy",
        "bias",
        "toxicity"
    ],
    threshold=0.8
)

print(f"Overall pass: {evaluation.overall_pass}")
for scorer, result in evaluation.results.items():
    print(f"{scorer}: {result.score:.2f} - {result.reasoning}")

import { Playgent } from "playgent";

const client = new Playgent({ apiKey: "your-api-key" });

const evaluation = await client.evaluate({
  input: "What is your refund policy?",
  output: "Our refund policy allows returns within 30 days...",
  expectedBehavior: "Should accurately describe the refund policy",
  context: ["Refund Policy Doc: Returns accepted within 30 days..."],
  groundTruth: "Returns are accepted within 30 days for a full refund.",
  scorers: [
    "answer_relevancy",
    "faithfulness",
    "contextual_relevancy",
    "bias",
    "toxicity",
  ],
  threshold: 0.8,
});

console.log(`Overall pass: ${evaluation.overallPass}`);

{
  "evaluation_id": "eval_mno345",
  "overall_pass": true,
  "results": {
    "answer_relevancy": {
      "score": 0.95,
      "pass": true,
      "reasoning": "Response directly addresses the question about refund policy. All key information is present and relevant.",
      "metadata": {}
    },
    "faithfulness": {
      "score": 0.88,
      "pass": true,
      "reasoning": "All claims in the response are supported by the provided context. The 30-day timeframe and full refund details match the policy document.",
      "metadata": {
        "claims_analyzed": [
          {
            "claim": "Returns within 30 days",
            "supported": true,
            "evidence": "Doc states 'Returns accepted within 30 days'"
          },
          {
            "claim": "Full refund",
            "supported": true,
            "evidence": "Doc states 'for full refund'"
          }
        ],
        "unsupported_claims": []
      }
    },
    "contextual_relevancy": {
      "score": 0.92,
      "pass": true,
      "reasoning": "The retrieved context is highly relevant. Contains the exact policy information needed to answer the question.",
      "metadata": {
        "relevant_chunks": 1,
        "total_chunks": 1,
        "relevancy_ratio": 1.0
      }
    },
    "bias": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No biased language detected. Response treats all users fairly and does not discriminate based on protected characteristics.",
      "metadata": {
        "flags": []
      }
    },
    "toxicity": {
      "score": 1.0,
      "pass": true,
      "reasoning": "No toxic or harmful language detected. Response is professional and appropriate.",
      "metadata": {
        "toxicity_level": "none"
      }
    }
  }
}

Multi-Turn Example

For evaluating multi-turn conversations, include conversation history:

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "And what about international orders?",
    "output": "For international orders, we offer returns within 30 days, but return shipping costs are the customer responsibility.",
    "conversation_history": [
      {
        "role": "user",
        "content": "What is your refund policy?"
      },
      {
        "role": "assistant",
        "content": "Our refund policy allows returns within 30 days..."
      }
    ],
    "context": ["International Returns: 30 day return window. Customer pays return shipping."],
    "scorers": [
      "turn_relevancy",
      "knowledge_retention",
      "turn_faithfulness"
    ]
  }'

Agentic Evaluation Example

For evaluating agentic workflows with tool use:

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Book a flight to Paris and hotel for 3 nights",
    "output": "I have booked your flight to Paris on Dec 20th and reserved a hotel for 3 nights (Dec 20-23).",
    "agent_plan": "1. Search flights 2. Book selected flight 3. Search hotels 4. Book hotel",
    "tool_calls": [
      {
        "name": "search_flights",
        "arguments": {"destination": "Paris", "date": "2024-12-20"},
        "result": {"flights": [...]}
      },
      {
        "name": "book_flight",
        "arguments": {"flight_id": "AF123"},
        "result": {"status": "confirmed"}
      },
      {
        "name": "search_hotels",
        "arguments": {"city": "Paris", "checkin": "2024-12-20", "nights": 3},
        "result": {"hotels": [...]}
      },
      {
        "name": "book_hotel",
        "arguments": {"hotel_id": "HTL456", "nights": 3},
        "result": {"status": "confirmed"}
      }
    ],
    "scorers": [
      "task_completion",
      "tool_correctness",
      "argument_correctness",
      "plan_adherence",
      "step_efficiency"
    ]
  }'

Safety Evaluation Example

For evaluating safety and compliance:

curl -X POST https://api.playgent.com/v1/evaluate \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Can you help me with my taxes?",
    "output": "I can provide general information about tax filing, but I recommend consulting with a qualified tax professional for specific advice about your situation.",
    "scorers": [
      "bias",
      "toxicity",
      "non_advice",
      "pii_leakage",
      "role_violation"
    ]
  }'

evaluation = client.evaluate(
    input="Can you help me with my taxes?",
    output="I can provide general information, but consult a tax professional for specific advice.",
    scorers=["bias", "toxicity", "non_advice", "pii_leakage", "role_violation"]
)

# All safety checks
for scorer, result in evaluation.results.items():
    if not result.pass:
        print(f"⚠️ {scorer} failed: {result.reasoning}")

Notes

RAG metrics require context parameter
Multi-turn metrics require conversation_history parameter
Agentic metrics require tool_calls and optionally agent_plan
Safety metrics work on any input/output pair
Use playval for custom evaluation criteria via expected_behavior
Combine multiple metric types in a single request for comprehensive evaluation

Get Trace Batch Evaluate

Overview

Agents

Test Cases

Test Runs

Tracing

Evaluation

Optimization

Webhooks

Analytics

🎯 Built-in Evaluation Metrics

Custom Evaluations

RAG (Retrieval-Augmented Generation)

Agentic Workflows

Safety & Security

Multi-Turn Conversations

Parameters

Response

Multi-Turn Example

Agentic Evaluation Example

Safety Evaluation Example

Notes

​🎯 Built-in Evaluation Metrics

​Custom Evaluations

​RAG (Retrieval-Augmented Generation)

​Agentic Workflows

​Safety & Security

​Multi-Turn Conversations

​Parameters

​Response

​Multi-Turn Example

​Agentic Evaluation Example

​Safety Evaluation Example

​Notes

🎯 Built-in Evaluation Metrics

Custom Evaluations

RAG (Retrieval-Augmented Generation)

Agentic Workflows

Safety & Security

Multi-Turn Conversations

Parameters

Response

Multi-Turn Example

Agentic Evaluation Example

Safety Evaluation Example

Notes