Initial commit: VLM-as-judge prompt calibration loop

Qwen3-VL image-similarity judge node, external-prompt receptor node, agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/ calibration-policy docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:15:56 +02:00
commit 95198a15b5
13 changed files with 1294 additions and 0 deletions
@@ -0,0 +1,135 @@
+# Calibration policy — the agent's playbook
+
+This is the instruction set the **external CLI agent** (the controller) follows each
+iteration. Paste the "Agent system prompt" block into your agent, give it the workflow
+path + reference image + target score, and let it loop.
+
+The agent calibrates by reasoning over the **Prompt‑Builder axes** and editing a
+structured *axis state*, then **rendering that state to a prompt string** that it injects
+into the `CalibratorPromptReceptor`. This keeps the reasoning axis‑aware while staying
+compatible with the flat‑string receptor. (If you later switch the receptor to carry a
+structured config, the same axis state maps straight onto Prompt‑Builder's split control
+nodes.)
+
+---
+
+## Axis state (the agent's working memory)
+
+```json
+{
+  "cast":        "1 woman, mid-20s, athletic",
+  "clothing":    "red lace lingerie",
+  "pose":        "standing, hand on hip",
+  "scene":       "dimly lit bedroom",
+  "composition": "full-body shot, slight low angle",
+  "expression":  "soft smile, eye contact",
+  "color_light": "warm rim light, shallow depth of field",
+  "quality":     "photorealistic, high detail",
+  "negative":    "blurry, deformed, lowres, extra limbs",
+  "seed":        12345
+}
+```
+
+These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried
+but not scored. Render order (subject → wardrobe → action → setting → framing → affect →
+light → quality):
+
+```
+prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
+```
+
+---
+
+## Per‑iteration algorithm (greedy per‑axis hill‑climb)
+
+```
+best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
+loop:
+  i += 1
+  prompt = render(state)
+  report = run agent_bridge.py --prompt prompt --negative state.negative
+                               --seed state.seed --run-tag iter{i}
+                               --workflow wf.json --analysis-dir <report_dir>
+  score  = report.overall_score
+  if score >= TARGET:            # e.g. 0.85
+      stop("converged", state, score)
+  if score > best_score:
+      best_score = score ; best_state = state ; stale = 0
+  else:
+      stale += 1
+      state = best_state         # revert: undo the change that didn't help
+  if stale >= PATIENCE or i >= MAX_ITERS:    # e.g. PATIENCE=4, MAX_ITERS=25
+      stop("plateau/budget", best_state, best_score)
+
+  # choose the next single edit:
+  worst_axis = axis with lowest per-axis score in report.axes
+  edit = map_fix_to_axis(report.fix_suggestions, worst_axis)  # apply the model's suggestion
+  state = apply(best_state, worst_axis, edit)                  # change ONE axis only
+```
+
+### Rules that matter
+
+1. **Change one axis per iteration.** One edit = clean attribution of the score delta.
+   Only batch two edits when two axes score very low *and* are clearly independent.
+2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not
+   sampler noise. Vary the seed only after you've converged, to confirm robustness.
+3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the
+   "revert on no improvement" step. Prevents drifting down a bad path.
+4. **Target the lowest‑scoring axis first**, applying the Judge's matching
+   `fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value
+   for that axis before moving on.
+5. **Near the margin, don't over‑trust one reading.** `swap_eval` already averages two
+   orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare
+   averages before committing.
+6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size
+   (smaller, more specific wording changes) and re‑anchor on `best_state`.
+7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, per‑axis)`.
+   The run must be auditable and resumable.
+
+### Mapping `fix_suggestions` → axes
+
+The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to
+clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous,
+attribute it to the lowest‑scoring axis it plausibly affects.
+
+---
+
+## Worked example
+
+```
+iter1  prompt="1 woman, casual outfit, indoors, ..."          score=0.41
+       axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
+       fix:  "set scene to a dim bedroom"
+iter2  edit scene→"dimly lit bedroom"                          score=0.58  (kept)
+       axes: pose 0.35 (worst) — "ref standing, gen seated"
+iter3  edit pose→"standing, hand on hip"                       score=0.71  (kept)
+       axes: color_light 0.50 (worst) — "ref warm, gen flat"
+iter4  edit color_light→"warm rim light"                       score=0.69  (worse → revert)
+iter5  edit color_light→"warm golden hour glow"               score=0.83  (kept)
+       axes: clothing 0.78 (worst) — "gen lacks lace detail"
+iter6  edit clothing→"red lace lingerie with trim"            score=0.88  ≥ target → STOP
+```
+
+---
+
+## Agent system prompt (paste into your CLI agent)
+
+> You are the controller for a local image prompt calibrator. Goal: make a generated
+> image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes
+> (cast, clothing, pose, scene, composition, expression, color_light) from 0–1.
+>
+> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a
+> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→
+> quality; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
+> --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir
+> <report_dir>`; (3) read the printed JSON report.
+>
+> Then apply greedy per‑axis hill‑climb: keep the change only if `overall_score` improved,
+> else revert to the best state; pick the **lowest‑scoring axis** and apply the Judge's
+> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching.
+> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 non‑improving
+> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.
+>
+> Never change more than one axis at a time unless two axes are both very low and clearly
+> independent. Never trust a single near‑margin reading — re‑run on a second seed when two
+> candidates are within 0.03.