Files
ComfyUI-Prompt-Calibrator/docs/CALIBRATION_POLICY.md
T
Ethanfel 95198a15b5 Initial commit: VLM-as-judge prompt calibration loop
Qwen3-VL image-similarity judge node, external-prompt receptor node,
agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/
calibration-policy docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:15:56 +02:00

6.3 KiB
Raw Blame History

Calibration policy — the agent's playbook

This is the instruction set the external CLI agent (the controller) follows each iteration. Paste the "Agent system prompt" block into your agent, give it the workflow path + reference image + target score, and let it loop.

The agent calibrates by reasoning over the PromptBuilder axes and editing a structured axis state, then rendering that state to a prompt string that it injects into the CalibratorPromptReceptor. This keeps the reasoning axisaware while staying compatible with the flatstring receptor. (If you later switch the receptor to carry a structured config, the same axis state maps straight onto PromptBuilder's split control nodes.)


Axis state (the agent's working memory)

{
  "cast":        "1 woman, mid-20s, athletic",
  "clothing":    "red lace lingerie",
  "pose":        "standing, hand on hip",
  "scene":       "dimly lit bedroom",
  "composition": "full-body shot, slight low angle",
  "expression":  "soft smile, eye contact",
  "color_light": "warm rim light, shallow depth of field",
  "quality":     "photorealistic, high detail",
  "negative":    "blurry, deformed, lowres, extra limbs",
  "seed":        12345
}

These keys are exactly the Judge's scoring axes. quality/negative/seed are carried but not scored. Render order (subject → wardrobe → action → setting → framing → affect → light → quality):

prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])

Periteration algorithm (greedy peraxis hillclimb)

best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
  i += 1
  prompt = render(state)
  report = run agent_bridge.py --prompt prompt --negative state.negative
                               --seed state.seed --run-tag iter{i}
                               --workflow wf.json --analysis-dir <report_dir>
  score  = report.overall_score
  if score >= TARGET:            # e.g. 0.85
      stop("converged", state, score)
  if score > best_score:
      best_score = score ; best_state = state ; stale = 0
  else:
      stale += 1
      state = best_state         # revert: undo the change that didn't help
  if stale >= PATIENCE or i >= MAX_ITERS:    # e.g. PATIENCE=4, MAX_ITERS=25
      stop("plateau/budget", best_state, best_score)

  # choose the next single edit:
  worst_axis = axis with lowest per-axis score in report.axes
  edit = map_fix_to_axis(report.fix_suggestions, worst_axis)  # apply the model's suggestion
  state = apply(best_state, worst_axis, edit)                  # change ONE axis only

Rules that matter

  1. Change one axis per iteration. One edit = clean attribution of the score delta. Only batch two edits when two axes score very low and are clearly independent.
  2. Freeze seed while searching axes. The score must reflect the prompt, not sampler noise. Vary the seed only after you've converged, to confirm robustness.
  3. Always edit from best_state, not the last (possibly worse) state — that's the "revert on no improvement" step. Prevents drifting down a bad path.
  4. Target the lowestscoring axis first, applying the Judge's matching fix_suggestion. If a suggestion doesn't help after a try, pick an alternative value for that axis before moving on.
  5. Near the margin, don't overtrust one reading. swap_eval already averages two orderings; if two candidates are within ~0.03, rerun each on a second seed and compare averages before committing.
  6. Detect gaming/oscillation. If scores bounce without net gain, reduce edit size (smaller, more specific wording changes) and reanchor on best_state.
  7. Log every step: (iter, axis_changed, old→new value, prompt, overall_score, peraxis). The run must be auditable and resumable.

Mapping fix_suggestions → axes

The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous, attribute it to the lowestscoring axis it plausibly affects.


Worked example

iter1  prompt="1 woman, casual outfit, indoors, ..."          score=0.41
       axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
       fix:  "set scene to a dim bedroom"
iter2  edit scene→"dimly lit bedroom"                          score=0.58  (kept)
       axes: pose 0.35 (worst) — "ref standing, gen seated"
iter3  edit pose→"standing, hand on hip"                       score=0.71  (kept)
       axes: color_light 0.50 (worst) — "ref warm, gen flat"
iter4  edit color_light→"warm rim light"                       score=0.69  (worse → revert)
iter5  edit color_light→"warm golden hour glow"               score=0.83  (kept)
       axes: clothing 0.78 (worst) — "gen lacks lace detail"
iter6  edit clothing→"red lace lingerie with trim"            score=0.88  ≥ target → STOP

Agent system prompt (paste into your CLI agent)

You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference image, measured by a Qwen3VL judge that scores 7 axes (cast, clothing, pose, scene, composition, expression, color_light) from 01.

You hold an axis state (JSON, keys above). Each turn you: (1) render the state to a prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ quality; (2) run python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON report.

Then apply greedy peraxis hillclimb: keep the change only if overall_score improved, else revert to the best state; pick the lowestscoring axis and apply the Judge's matching fix_suggestion as a single edit. Keep the seed fixed while searching. Stop when overall_score ≥ TARGET (default 0.85), or after PATIENCE=4 nonimproving iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.

Never change more than one axis at a time unless two axes are both very low and clearly independent. Never trust a single nearmargin reading — rerun on a second seed when two candidates are within 0.03.