Files
ComfyUI-Prompt-Calibrator/docs/CALIBRATION_POLICY.md
T
Ethanfel 95198a15b5 Initial commit: VLM-as-judge prompt calibration loop
Qwen3-VL image-similarity judge node, external-prompt receptor node,
agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/
calibration-policy docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:15:56 +02:00

136 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Calibration policy — the agent's playbook
This is the instruction set the **external CLI agent** (the controller) follows each
iteration. Paste the "Agent system prompt" block into your agent, give it the workflow
path + reference image + target score, and let it loop.
The agent calibrates by reasoning over the **PromptBuilder axes** and editing a
structured *axis state*, then **rendering that state to a prompt string** that it injects
into the `CalibratorPromptReceptor`. This keeps the reasoning axisaware while staying
compatible with the flatstring receptor. (If you later switch the receptor to carry a
structured config, the same axis state maps straight onto PromptBuilder's split control
nodes.)
---
## Axis state (the agent's working memory)
```json
{
"cast": "1 woman, mid-20s, athletic",
"clothing": "red lace lingerie",
"pose": "standing, hand on hip",
"scene": "dimly lit bedroom",
"composition": "full-body shot, slight low angle",
"expression": "soft smile, eye contact",
"color_light": "warm rim light, shallow depth of field",
"quality": "photorealistic, high detail",
"negative": "blurry, deformed, lowres, extra limbs",
"seed": 12345
}
```
These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried
but not scored. Render order (subject → wardrobe → action → setting → framing → affect →
light → quality):
```
prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
```
---
## Periteration algorithm (greedy peraxis hillclimb)
```
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state)
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--workflow wf.json --analysis-dir <report_dir>
score = report.overall_score
if score >= TARGET: # e.g. 0.85
stop("converged", state, score)
if score > best_score:
best_score = score ; best_state = state ; stale = 0
else:
stale += 1
state = best_state # revert: undo the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25
stop("plateau/budget", best_state, best_score)
# choose the next single edit:
worst_axis = axis with lowest per-axis score in report.axes
edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion
state = apply(best_state, worst_axis, edit) # change ONE axis only
```
### Rules that matter
1. **Change one axis per iteration.** One edit = clean attribution of the score delta.
Only batch two edits when two axes score very low *and* are clearly independent.
2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not
sampler noise. Vary the seed only after you've converged, to confirm robustness.
3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the
"revert on no improvement" step. Prevents drifting down a bad path.
4. **Target the lowestscoring axis first**, applying the Judge's matching
`fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value
for that axis before moving on.
5. **Near the margin, don't overtrust one reading.** `swap_eval` already averages two
orderings; if two candidates are within ~0.03, rerun each on a second seed and compare
averages before committing.
6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size
(smaller, more specific wording changes) and reanchor on `best_state`.
7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, peraxis)`.
The run must be auditable and resumable.
### Mapping `fix_suggestions` → axes
The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to
clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous,
attribute it to the lowestscoring axis it plausibly affects.
---
## Worked example
```
iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41
axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
fix: "set scene to a dim bedroom"
iter2 edit scene→"dimly lit bedroom" score=0.58 (kept)
axes: pose 0.35 (worst) — "ref standing, gen seated"
iter3 edit pose→"standing, hand on hip" score=0.71 (kept)
axes: color_light 0.50 (worst) — "ref warm, gen flat"
iter4 edit color_light→"warm rim light" score=0.69 (worse → revert)
iter5 edit color_light→"warm golden hour glow" score=0.83 (kept)
axes: clothing 0.78 (worst) — "gen lacks lace detail"
iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP
```
---
## Agent system prompt (paste into your CLI agent)
> You are the controller for a local image prompt calibrator. Goal: make a generated
> image match a reference image, measured by a Qwen3VL judge that scores 7 axes
> (cast, clothing, pose, scene, composition, expression, color_light) from 01.
>
> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a
> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→
> quality; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
> --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir
> <report_dir>`; (3) read the printed JSON report.
>
> Then apply greedy peraxis hillclimb: keep the change only if `overall_score` improved,
> else revert to the best state; pick the **lowestscoring axis** and apply the Judge's
> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching.
> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 nonimproving
> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.
>
> Never change more than one axis at a time unless two axes are both very low and clearly
> independent. Never trust a single nearmargin reading — rerun on a second seed when two
> candidates are within 0.03.