ComfyUI-Prompt-Calibrator/docs/CALIBRATION_POLICY.md

# Calibration policy — the agent's playbook

The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
**external agent** (you / a stronger model) decides every correction. So the judge's job
is to hand the agent the *range of information needed to calibrate*, and the agent's job
is to turn that into prompt edits.

## What the agent needs from each comparison (the information model)

To move a generated image toward a reference, for **every dimension the prompt controls**
the agent needs three things:

| field | meaning | why the agent needs it |
|---|---|---|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
| `gen` | what the **generated** image shows | the **current** state — what to change |
| `score` | 0–1 closeness | the **gap / priority** — which axes to fix first |

That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
(`{"score", "ref", "gen"}`) plus a top-level `overall_score`.

The axes must **span what the prompt can express** — you can only fix what the prompt can
say, and each diff must map to a lever. The default set (configurable on the node) is
grouped below.

## Axes (default set — edit `axes` on the node to taste)

- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
- **Wardrobe:** `clothing_state` (degree of undress + garments)
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
  `penetration`, `explicitness`, `body_contact`
- **Affect:** `pose`, `facial_expression`, `gaze`
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
- **Render:** `scene`, `lighting_color`, `art_style`

Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
interaction cluster granular so the agent gets actionable targets.

## Step 0 — first pass (describe / bootstrap)

The very first iteration has no generated image yet, so the judge runs in **describe
mode**: it looks at the reference alone and returns a prompt-ready `caption` plus a
per-axis target spec. That seeds everything:

```bash
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
  --run-tag seed --analysis-dir <report_dir>
```
→ `latest.json` = `{"mode":"describe", "caption":"...", "axes":{axis: "value", ...}}`

The agent takes `caption` as the **initial prompt** and `axes` as the **initial
axis_state**, then enters the compare loop below. No reference description has to be
written by hand — the VL provides the target to reproduce.

## Per-iteration algorithm (greedy per-axis hill-climb)

```
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
  i += 1
  prompt = render(state)                       # state = current value per axis
  report = run agent_bridge.py --prompt prompt --negative state.negative
                               --seed state.seed --run-tag iter{i}
                               --workflow wf.json --analysis-dir <report_dir>
  if report.overall_score >= TARGET: stop("converged", state)         # e.g. 0.85
  if report.overall_score > best_score:
      best_score = report.overall_score ; best_state = state ; stale = 0
  else:
      stale += 1 ; state = best_state          # revert the change that didn't help
  if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)

  worst = axis with the lowest report.axes[*].score
  target_value = report.axes[worst].ref         # what the reference shows
  state = apply(best_state, worst, edit_toward(target_value))   # change ONE axis
```

`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.

### Rules that matter

1. **Change one axis per iteration** — clean attribution of the score delta. Batch two
   only when both are very low and clearly independent.
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
   noise. Vary the seed only after converging, to confirm robustness.
3. **Always edit from `best_state`**, never from a worse last state.
4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
   after a try, try an alternative phrasing for that axis before moving on.
5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
   orderings; if two candidates are within ~0.03, re-run each on a second seed.
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.

## Worked example

```
iter1  overall=0.41   worst: scene 0.30  ref:[dim bedroom]   gen:[bright kitchen]
       edit scene → "dimly lit bedroom"
iter2  overall=0.58   worst: position 0.35  ref:[doggy style] gen:[missionary]
       edit position → "doggy style"
iter3  overall=0.71   worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
       edit lighting → "warm low-key lighting"   (0.69 → revert)
iter4  overall=0.69   retry lighting → "warm golden low-key glow"   (0.84 → keep)
iter5  overall=0.84   worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
       edit clothing → "red lace lingerie"
iter6  overall=0.89   ≥ target → STOP
```

## Report shape the agent reads (`latest.json` / stdout)

```json
{
  "run_tag": "iter002",
  "overall_score": 0.58,
  "axes": {
    "position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
    "scene":    {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
  },
  "prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}
```

## Agent system prompt (paste into your CLI agent)

> You are the controller for a local image prompt calibrator. Goal: make a generated
> image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
> body, wardrobe, action, affect, camera, render) and for each returns `score` (0–1
> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
>
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
> --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
> (3) read the printed JSON.
>
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
> revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
> wording to match its `ref` value (you decide the wording — there are no machine-supplied
> fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
> `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
> Log every step and report the best prompt + score.