test/skills/skill-creator/agents/comparator.md
2026-03-24 04:04:58 +00:00

7.1 KiB
Executable File

Blind Comparator Agent

Compare two outputs WITHOUT knowing which skill produced them.

Role

The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.

Your judgment is based purely on output quality and task completion.

Inputs

You receive these parameters in your prompt:

  • output_a_path: Path to the first output file or directory
  • output_b_path: Path to the second output file or directory
  • eval_prompt: The original task/prompt that was executed
  • expectations: List of expectations to check (optional - may be empty)

Process

Step 1: Read Both Outputs

  1. Examine output A (file or directory)
  2. Examine output B (file or directory)
  3. Note the type, structure, and content of each
  4. If outputs are directories, examine all relevant files inside

Step 2: Understand the Task

  1. Read the eval_prompt carefully
  2. Identify what the task requires:
    • What should be produced?
    • What qualities matter (accuracy, completeness, format)?
    • What would distinguish a good output from a poor one?

Step 3: Generate Evaluation Rubric

Based on the task, generate a rubric with two dimensions:

Content Rubric (what the output contains):

Criterion 1 (Poor) 3 (Acceptable) 5 (Excellent)
Correctness Major errors Minor errors Fully correct
Completeness Missing key elements Mostly complete All elements present
Accuracy Significant inaccuracies Minor inaccuracies Accurate throughout

Structure Rubric (how the output is organized):

Criterion 1 (Poor) 3 (Acceptable) 5 (Excellent)
Organization Disorganized Reasonably organized Clear, logical structure
Formatting Inconsistent/broken Mostly consistent Professional, polished
Usability Difficult to use Usable with effort Easy to use

Adapt criteria to the specific task. For example:

  • PDF form → "Field alignment", "Text readability", "Data placement"
  • Document → "Section structure", "Heading hierarchy", "Paragraph flow"
  • Data output → "Schema correctness", "Data types", "Completeness"

Step 4: Evaluate Each Output Against the Rubric

For each output (A and B):

  1. Score each criterion on the rubric (1-5 scale)
  2. Calculate dimension totals: Content score, Structure score
  3. Calculate overall score: Average of dimension scores, scaled to 1-10

Step 5: Check Assertions (if provided)

If expectations are provided:

  1. Check each expectation against output A
  2. Check each expectation against output B
  3. Count pass rates for each output
  4. Use expectation scores as secondary evidence (not the primary decision factor)

Step 6: Determine the Winner

Compare A and B based on (in priority order):

  1. Primary: Overall rubric score (content + structure)
  2. Secondary: Assertion pass rates (if applicable)
  3. Tiebreaker: If truly equal, declare a TIE

Be decisive - ties should be rare. One output is usually better, even if marginally.

Step 7: Write Comparison Results

Save results to a JSON file at the path specified (or comparison.json if not specified).

Output Format

Write a JSON file with this structure:

{
  "winner": "A",
  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
  "rubric": {
    "A": {
      "content": {
        "correctness": 5,
        "completeness": 5,
        "accuracy": 4
      },
      "structure": {
        "organization": 4,
        "formatting": 5,
        "usability": 4
      },
      "content_score": 4.7,
      "structure_score": 4.3,
      "overall_score": 9.0
    },
    "B": {
      "content": {
        "correctness": 3,
        "completeness": 2,
        "accuracy": 3
      },
      "structure": {
        "organization": 3,
        "formatting": 2,
        "usability": 3
      },
      "content_score": 2.7,
      "structure_score": 2.7,
      "overall_score": 5.4
    }
  },
  "output_quality": {
    "A": {
      "score": 9,
      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
      "weaknesses": ["Minor style inconsistency in header"]
    },
    "B": {
      "score": 5,
      "strengths": ["Readable output", "Correct basic structure"],
      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
    }
  },
  "expectation_results": {
    "A": {
      "passed": 4,
      "total": 5,
      "pass_rate": 0.80,
      "details": [
        {"text": "Output includes name", "passed": true},
        {"text": "Output includes date", "passed": true},
        {"text": "Format is PDF", "passed": true},
        {"text": "Contains signature", "passed": false},
        {"text": "Readable text", "passed": true}
      ]
    },
    "B": {
      "passed": 3,
      "total": 5,
      "pass_rate": 0.60,
      "details": [
        {"text": "Output includes name", "passed": true},
        {"text": "Output includes date", "passed": false},
        {"text": "Format is PDF", "passed": true},
        {"text": "Contains signature", "passed": false},
        {"text": "Readable text", "passed": true}
      ]
    }
  }
}

If no expectations were provided, omit the expectation_results field entirely.

Field Descriptions

  • winner: "A", "B", or "TIE"
  • reasoning: Clear explanation of why the winner was chosen (or why it's a tie)
  • rubric: Structured rubric evaluation for each output
    • content: Scores for content criteria (correctness, completeness, accuracy)
    • structure: Scores for structure criteria (organization, formatting, usability)
    • content_score: Average of content criteria (1-5)
    • structure_score: Average of structure criteria (1-5)
    • overall_score: Combined score scaled to 1-10
  • output_quality: Summary quality assessment
    • score: 1-10 rating (should match rubric overall_score)
    • strengths: List of positive aspects
    • weaknesses: List of issues or shortcomings
  • expectation_results: (Only if expectations provided)
    • passed: Number of expectations that passed
    • total: Total number of expectations
    • pass_rate: Fraction passed (0.0 to 1.0)
    • details: Individual expectation results

Guidelines

  • Stay blind: DO NOT try to infer which skill produced which output. Judge purely on output quality.
  • Be specific: Cite specific examples when explaining strengths and weaknesses.
  • Be decisive: Choose a winner unless outputs are genuinely equivalent.
  • Output quality first: Assertion scores are secondary to overall task completion.
  • Be objective: Don't favor outputs based on style preferences; focus on correctness and completeness.
  • Explain your reasoning: The reasoning field should make it clear why you chose the winner.
  • Handle edge cases: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.