# Calibration policy — the agent's playbook This is the instruction set the **external CLI agent** (the controller) follows each iteration. Paste the "Agent system prompt" block into your agent, give it the workflow path + reference image + target score, and let it loop. The agent calibrates by reasoning over the **Prompt‑Builder axes** and editing a structured *axis state*, then **rendering that state to a prompt string** that it injects into the `CalibratorPromptReceptor`. This keeps the reasoning axis‑aware while staying compatible with the flat‑string receptor. (If you later switch the receptor to carry a structured config, the same axis state maps straight onto Prompt‑Builder's split control nodes.) --- ## Axis state (the agent's working memory) ```json { "cast": "1 woman, mid-20s, athletic", "clothing": "red lace lingerie", "pose": "standing, hand on hip", "scene": "dimly lit bedroom", "composition": "full-body shot, slight low angle", "expression": "soft smile, eye contact", "color_light": "warm rim light, shallow depth of field", "quality": "photorealistic, high detail", "negative": "blurry, deformed, lowres, extra limbs", "seed": 12345 } ``` These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried but not scored. Render order (subject → wardrobe → action → setting → framing → affect → light → quality): ``` prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality]) ``` --- ## Per‑iteration algorithm (greedy per‑axis hill‑climb) ``` best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0 loop: i += 1 prompt = render(state) report = run agent_bridge.py --prompt prompt --negative state.negative --seed state.seed --run-tag iter{i} --workflow wf.json --analysis-dir score = report.overall_score if score >= TARGET: # e.g. 0.85 stop("converged", state, score) if score > best_score: best_score = score ; best_state = state ; stale = 0 else: stale += 1 state = best_state # revert: undo the change that didn't help if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25 stop("plateau/budget", best_state, best_score) # choose the next single edit: worst_axis = axis with lowest per-axis score in report.axes edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion state = apply(best_state, worst_axis, edit) # change ONE axis only ``` ### Rules that matter 1. **Change one axis per iteration.** One edit = clean attribution of the score delta. Only batch two edits when two axes score very low *and* are clearly independent. 2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not sampler noise. Vary the seed only after you've converged, to confirm robustness. 3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the "revert on no improvement" step. Prevents drifting down a bad path. 4. **Target the lowest‑scoring axis first**, applying the Judge's matching `fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value for that axis before moving on. 5. **Near the margin, don't over‑trust one reading.** `swap_eval` already averages two orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare averages before committing. 6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size (smaller, more specific wording changes) and re‑anchor on `best_state`. 7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, per‑axis)`. The run must be auditable and resumable. ### Mapping `fix_suggestions` → axes The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous, attribute it to the lowest‑scoring axis it plausibly affects. --- ## Worked example ``` iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41 axes: scene 0.30 (worst) — "ref bedroom, gen kitchen" fix: "set scene to a dim bedroom" iter2 edit scene→"dimly lit bedroom" score=0.58 (kept) axes: pose 0.35 (worst) — "ref standing, gen seated" iter3 edit pose→"standing, hand on hip" score=0.71 (kept) axes: color_light 0.50 (worst) — "ref warm, gen flat" iter4 edit color_light→"warm rim light" score=0.69 (worse → revert) iter5 edit color_light→"warm golden hour glow" score=0.83 (kept) axes: clothing 0.78 (worst) — "gen lacks lace detail" iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP ``` --- ## Agent system prompt (paste into your CLI agent) > You are the controller for a local image prompt calibrator. Goal: make a generated > image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes > (cast, clothing, pose, scene, composition, expression, color_light) from 0–1. > > You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a > prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ > quality; (2) run `python agent_bridge.py --workflow --prompt "" > --negative "" --seed --run-tag iter --analysis-dir > `; (3) read the printed JSON report. > > Then apply greedy per‑axis hill‑climb: keep the change only if `overall_score` improved, > else revert to the best state; pick the **lowest‑scoring axis** and apply the Judge's > matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching. > Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 non‑improving > iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score. > > Never change more than one axis at a time unless two axes are both very low and clearly > independent. Never trust a single near‑margin reading — re‑run on a second seed when two > candidates are within 0.03.