Skip to main content

Documentation Index

Fetch the complete documentation index at: https://playgent.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations

Evaluation scores agent outputs using metrics. Playgent provides 29 built-in evaluation metrics across Custom, RAG, Agentic, Multi-Turn, and Safety categories.

Available Metrics (29 Total)

TypeMetricDescription
CustomplayvalGeneral-purpose LLM-as-judge with custom criteria
RAGanswer_relevancyHow relevant is the answer to the question?
RAGfaithfulnessIs the answer grounded in the provided context?
RAGcontextual_precisionAre relevant context chunks ranked higher?
RAGcontextual_recallDoes the context contain all needed information?
RAGcontextual_relevancyIs the retrieved context relevant to the query?
Agentictask_completionDid the agent complete the requested task?
Agentictool_correctnessWere the right tools selected?
Agenticargument_correctnessWere tool arguments correct?
Agenticstep_efficiencyWere unnecessary steps avoided?
Agenticplan_adherenceDid the agent follow its stated plan?
Agenticplan_qualityWas the plan logical and effective?
Multi-Turnturn_relevancyIs each response relevant to its turn?
Multi-Turnrole_adherenceDoes agent maintain its role throughout?
Multi-Turnknowledge_retentionDoes agent remember earlier context?
Multi-Turnconversation_completenessWas the conversation goal achieved?
Multi-Turngoal_accuracyHow well did agent achieve the user’s goal?
Multi-Turntool_useWere tools used appropriately across turns?
Multi-Turntopic_adherenceDid agent stay on topic?
Multi-Turnturn_faithfulnessIs each turn grounded in provided context?
Multi-Turnturn_contextual_precisionContext precision per turn
Multi-Turnturn_contextual_recallContext recall per turn
Multi-Turnturn_contextual_relevancyContext relevancy per turn
SafetybiasDetects biased or discriminatory content
SafetytoxicityDetects harmful, offensive, or toxic language
Safetynon_adviceEnsures no professional advice (legal, medical, financial)
SafetymisuseDetects potential misuse or harmful instructions
Safetypii_leakageChecks for personally identifiable information leaks
Safetyrole_violationDetects when agent breaks character or role boundaries

Metric Requirements

TypeRequired Parameters
Custominput, output, expected_behavior
RAGinput, output, context
Agentictools_called, expected_tools
Multi-Turnconversation
Safetyoutput only

Quick Start

# Run a test and evaluate
run = client.runs.create(test_case_id="tc_xyz789")

evaluation = client.evaluate(
    run_id=run.id,
    scorers=["answer_relevancy", "faithfulness", "bias"]
)

print(f"Overall pass: {evaluation.overall_pass}")
for scorer, result in evaluation.results.items():
    print(f"{scorer}: {result.score:.2f}")

Common Use Cases

Ad-hoc Evaluation

Evaluate without running a test:
evaluation = client.evaluate(
    input="What is your refund policy?",
    output="Returns within 30 days for full refund.",
    context=["Policy doc..."],
    scorers=["answer_relevancy", "faithfulness"]
)

Batch Evaluation

Evaluate multiple runs:
batch = client.evaluate_batch(
    run_ids=["run_001", "run_002", "run_003"],
    scorers=["answer_relevancy", "faithfulness"]
)

Custom Thresholds

Override default 0.7 threshold:
evaluation = client.evaluate(
    run_id=run.id,
    scorers=["faithfulness"],
    thresholds={"faithfulness": 0.9}
)

Automatic Evaluation

Set default scorers on agent:
agent = client.agents.create(
    name="Support Agent",
    default_scorers=["answer_relevancy", "bias", "toxicity"]
)
# All runs auto-evaluated with these metrics

Next Steps

API Reference

Full evaluation API with all 29 metrics

Custom Scorers

Create domain-specific metrics