Files
ComfyUI-Prompt-Calibrator/docs/CALIBRATION_POLICY.md
T
Ethanfel c7ef756a71 Add describe (first-pass) mode to the judge node
New mode on QwenVLImageJudge: 'describe' looks at the reference alone and returns
a prompt-ready caption + per-axis target spec to seed the very first prompt (the
generator has nothing to reproduce yet). 'compare' is the existing ref-vs-gen
scoring. generated_image is now optional (required only for compare); shared
generation refactored into _generate_from_messages; third output renamed
diff_analysis -> analysis (mode-agnostic). agent_bridge gains --mode (describe
needs no receptor/prompt); added workflow_describe_api.json. Docs updated with the
first-pass bootstrap step. Fixed error-return arity to 5-tuple.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 23:04:09 +02:00

143 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Calibration policy — the agent's playbook
The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
**external agent** (you / a stronger model) decides every correction. So the judge's job
is to hand the agent the *range of information needed to calibrate*, and the agent's job
is to turn that into prompt edits.
## What the agent needs from each comparison (the information model)
To move a generated image toward a reference, for **every dimension the prompt controls**
the agent needs three things:
| field | meaning | why the agent needs it |
|---|---|---|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
| `gen` | what the **generated** image shows | the **current** state — what to change |
| `score` | 01 closeness | the **gap / priority** — which axes to fix first |
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
(`{"score", "ref", "gen"}`) plus a top-level `overall_score`.
The axes must **span what the prompt can express** — you can only fix what the prompt can
say, and each diff must map to a lever. The default set (configurable on the node) is
grouped below.
## Axes (default set — edit `axes` on the node to taste)
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
- **Wardrobe:** `clothing_state` (degree of undress + garments)
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
`penetration`, `explicitness`, `body_contact`
- **Affect:** `pose`, `facial_expression`, `gaze`
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
- **Render:** `scene`, `lighting_color`, `art_style`
Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
interaction cluster granular so the agent gets actionable targets.
## Step 0 — first pass (describe / bootstrap)
The very first iteration has no generated image yet, so the judge runs in **describe
mode**: it looks at the reference alone and returns a prompt-ready `caption` plus a
per-axis target spec. That seeds everything:
```bash
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
--run-tag seed --analysis-dir <report_dir>
```
`latest.json` = `{"mode":"describe", "caption":"...", "axes":{axis: "value", ...}}`
The agent takes `caption` as the **initial prompt** and `axes` as the **initial
axis_state**, then enters the compare loop below. No reference description has to be
written by hand — the VL provides the target to reproduce.
## Per-iteration algorithm (greedy per-axis hill-climb)
```
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state) # state = current value per axis
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--workflow wf.json --analysis-dir <report_dir>
if report.overall_score >= TARGET: stop("converged", state) # e.g. 0.85
if report.overall_score > best_score:
best_score = report.overall_score ; best_state = state ; stale = 0
else:
stale += 1 ; state = best_state # revert the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
worst = axis with the lowest report.axes[*].score
target_value = report.axes[worst].ref # what the reference shows
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
```
`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
### Rules that matter
1. **Change one axis per iteration** — clean attribution of the score delta. Batch two
only when both are very low and clearly independent.
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
noise. Vary the seed only after converging, to confirm robustness.
3. **Always edit from `best_state`**, never from a worse last state.
4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
after a try, try an alternative phrasing for that axis before moving on.
5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
orderings; if two candidates are within ~0.03, re-run each on a second seed.
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.
## Worked example
```
iter1 overall=0.41 worst: scene 0.30 ref:[dim bedroom] gen:[bright kitchen]
edit scene → "dimly lit bedroom"
iter2 overall=0.58 worst: position 0.35 ref:[doggy style] gen:[missionary]
edit position → "doggy style"
iter3 overall=0.71 worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
edit lighting → "warm low-key lighting" (0.69 → revert)
iter4 overall=0.69 retry lighting → "warm golden low-key glow" (0.84 → keep)
iter5 overall=0.84 worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
edit clothing → "red lace lingerie"
iter6 overall=0.89 ≥ target → STOP
```
## Report shape the agent reads (`latest.json` / stdout)
```json
{
"run_tag": "iter002",
"overall_score": 0.58,
"axes": {
"position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
"scene": {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
},
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}
```
## Agent system prompt (paste into your CLI agent)
> You are the controller for a local image prompt calibrator. Goal: make a generated
> image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
> body, wardrobe, action, affect, camera, render) and for each returns `score` (01
> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
>
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
> --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
> (3) read the printed JSON.
>
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
> revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
> wording to match its `ref` value (you decide the wording — there are no machine-supplied
> fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
> `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
> Log every step and report the best prompt + score.