Files

T

Ethanfel 95198a15b5 Initial commit: VLM-as-judge prompt calibration loop

Qwen3-VL image-similarity judge node, external-prompt receptor node,
agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/
calibration-policy docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-26 22:15:56 +02:00

6.3 KiB

Raw Blame History

Calibration policy — the agent's playbook

This is the instruction set the external CLI agent (the controller) follows each iteration. Paste the "Agent system prompt" block into your agent, give it the workflow path + reference image + target score, and let it loop.

The agent calibrates by reasoning over the Prompt‑Builder axes and editing a structured axis state, then rendering that state to a prompt string that it injects into the CalibratorPromptReceptor. This keeps the reasoning axis‑aware while staying compatible with the flat‑string receptor. (If you later switch the receptor to carry a structured config, the same axis state maps straight onto Prompt‑Builder's split control nodes.)

Axis state (the agent's working memory)

{
  "cast":        "1 woman, mid-20s, athletic",
  "clothing":    "red lace lingerie",
  "pose":        "standing, hand on hip",
  "scene":       "dimly lit bedroom",
  "composition": "full-body shot, slight low angle",
  "expression":  "soft smile, eye contact",
  "color_light": "warm rim light, shallow depth of field",
  "quality":     "photorealistic, high detail",
  "negative":    "blurry, deformed, lowres, extra limbs",
  "seed":        12345
}

These keys are exactly the Judge's scoring axes. quality/negative/seed are carried but not scored. Render order (subject → wardrobe → action → setting → framing → affect → light → quality):

prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])

Per‑iteration algorithm (greedy per‑axis hill‑climb)

best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
  i += 1
  prompt = render(state)
  report = run agent_bridge.py --prompt prompt --negative state.negative
                               --seed state.seed --run-tag iter{i}
                               --workflow wf.json --analysis-dir <report_dir>
  score  = report.overall_score
  if score >= TARGET:            # e.g. 0.85
      stop("converged", state, score)
  if score > best_score:
      best_score = score ; best_state = state ; stale = 0
  else:
      stale += 1
      state = best_state         # revert: undo the change that didn't help
  if stale >= PATIENCE or i >= MAX_ITERS:    # e.g. PATIENCE=4, MAX_ITERS=25
      stop("plateau/budget", best_state, best_score)

  # choose the next single edit:
  worst_axis = axis with lowest per-axis score in report.axes
  edit = map_fix_to_axis(report.fix_suggestions, worst_axis)  # apply the model's suggestion
  state = apply(best_state, worst_axis, edit)                  # change ONE axis only

Rules that matter

Change one axis per iteration. One edit = clean attribution of the score delta. Only batch two edits when two axes score very low and are clearly independent.
Freeze seed while searching axes. The score must reflect the prompt, not sampler noise. Vary the seed only after you've converged, to confirm robustness.
Always edit from best_state, not the last (possibly worse) state — that's the "revert on no improvement" step. Prevents drifting down a bad path.
Target the lowest‑scoring axis first, applying the Judge's matching fix_suggestion. If a suggestion doesn't help after a try, pick an alternative value for that axis before moving on.
Near the margin, don't over‑trust one reading. swap_eval already averages two orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare averages before committing.
Detect gaming/oscillation. If scores bounce without net gain, reduce edit size (smaller, more specific wording changes) and re‑anchor on best_state.
Log every step: (iter, axis_changed, old→new value, prompt, overall_score, per‑axis). The run must be auditable and resumable.

Mapping `fix_suggestions` → axes

The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous, attribute it to the lowest‑scoring axis it plausibly affects.

Worked example

iter1  prompt="1 woman, casual outfit, indoors, ..."          score=0.41
       axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
       fix:  "set scene to a dim bedroom"
iter2  edit scene→"dimly lit bedroom"                          score=0.58  (kept)
       axes: pose 0.35 (worst) — "ref standing, gen seated"
iter3  edit pose→"standing, hand on hip"                       score=0.71  (kept)
       axes: color_light 0.50 (worst) — "ref warm, gen flat"
iter4  edit color_light→"warm rim light"                       score=0.69  (worse → revert)
iter5  edit color_light→"warm golden hour glow"               score=0.83  (kept)
       axes: clothing 0.78 (worst) — "gen lacks lace detail"
iter6  edit clothing→"red lace lingerie with trim"            score=0.88  ≥ target → STOP

Agent system prompt (paste into your CLI agent)

You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes (cast, clothing, pose, scene, composition, expression, color_light) from 0–1.

You hold an axis state (JSON, keys above). Each turn you: (1) render the state to a prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ quality; (2) run python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON report.

Then apply greedy per‑axis hill‑climb: keep the change only if overall_score improved, else revert to the best state; pick the lowest‑scoring axis and apply the Judge's matching fix_suggestion as a single edit. Keep the seed fixed while searching. Stop when overall_score ≥ TARGET (default 0.85), or after PATIENCE=4 non‑improving iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.

Never change more than one axis at a time unless two axes are both very low and clearly independent. Never trust a single near‑margin reading — re‑run on a second seed when two candidates are within 0.03.

6.3 KiB Raw Blame History Unescape Escape