Qwen3-VL image-similarity judge node, external-prompt receptor node, agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/ calibration-policy docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.3 KiB
Calibration policy — the agent's playbook
This is the instruction set the external CLI agent (the controller) follows each iteration. Paste the "Agent system prompt" block into your agent, give it the workflow path + reference image + target score, and let it loop.
The agent calibrates by reasoning over the Prompt‑Builder axes and editing a
structured axis state, then rendering that state to a prompt string that it injects
into the CalibratorPromptReceptor. This keeps the reasoning axis‑aware while staying
compatible with the flat‑string receptor. (If you later switch the receptor to carry a
structured config, the same axis state maps straight onto Prompt‑Builder's split control
nodes.)
Axis state (the agent's working memory)
{
"cast": "1 woman, mid-20s, athletic",
"clothing": "red lace lingerie",
"pose": "standing, hand on hip",
"scene": "dimly lit bedroom",
"composition": "full-body shot, slight low angle",
"expression": "soft smile, eye contact",
"color_light": "warm rim light, shallow depth of field",
"quality": "photorealistic, high detail",
"negative": "blurry, deformed, lowres, extra limbs",
"seed": 12345
}
These keys are exactly the Judge's scoring axes. quality/negative/seed are carried
but not scored. Render order (subject → wardrobe → action → setting → framing → affect →
light → quality):
prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
Per‑iteration algorithm (greedy per‑axis hill‑climb)
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state)
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--workflow wf.json --analysis-dir <report_dir>
score = report.overall_score
if score >= TARGET: # e.g. 0.85
stop("converged", state, score)
if score > best_score:
best_score = score ; best_state = state ; stale = 0
else:
stale += 1
state = best_state # revert: undo the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25
stop("plateau/budget", best_state, best_score)
# choose the next single edit:
worst_axis = axis with lowest per-axis score in report.axes
edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion
state = apply(best_state, worst_axis, edit) # change ONE axis only
Rules that matter
- Change one axis per iteration. One edit = clean attribution of the score delta. Only batch two edits when two axes score very low and are clearly independent.
- Freeze
seedwhile searching axes. The score must reflect the prompt, not sampler noise. Vary the seed only after you've converged, to confirm robustness. - Always edit from
best_state, not the last (possibly worse) state — that's the "revert on no improvement" step. Prevents drifting down a bad path. - Target the lowest‑scoring axis first, applying the Judge's matching
fix_suggestion. If a suggestion doesn't help after a try, pick an alternative value for that axis before moving on. - Near the margin, don't over‑trust one reading.
swap_evalalready averages two orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare averages before committing. - Detect gaming/oscillation. If scores bounce without net gain, reduce edit size
(smaller, more specific wording changes) and re‑anchor on
best_state. - Log every step:
(iter, axis_changed, old→new value, prompt, overall_score, per‑axis). The run must be auditable and resumable.
Mapping fix_suggestions → axes
The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous, attribute it to the lowest‑scoring axis it plausibly affects.
Worked example
iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41
axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
fix: "set scene to a dim bedroom"
iter2 edit scene→"dimly lit bedroom" score=0.58 (kept)
axes: pose 0.35 (worst) — "ref standing, gen seated"
iter3 edit pose→"standing, hand on hip" score=0.71 (kept)
axes: color_light 0.50 (worst) — "ref warm, gen flat"
iter4 edit color_light→"warm rim light" score=0.69 (worse → revert)
iter5 edit color_light→"warm golden hour glow" score=0.83 (kept)
axes: clothing 0.78 (worst) — "gen lacks lace detail"
iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP
Agent system prompt (paste into your CLI agent)
You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes (cast, clothing, pose, scene, composition, expression, color_light) from 0–1.
You hold an axis state (JSON, keys above). Each turn you: (1) render the state to a prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ quality; (2) run
python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON report.Then apply greedy per‑axis hill‑climb: keep the change only if
overall_scoreimproved, else revert to the best state; pick the lowest‑scoring axis and apply the Judge's matchingfix_suggestionas a single edit. Keep the seed fixed while searching. Stop whenoverall_score ≥ TARGET(default 0.85), or after PATIENCE=4 non‑improving iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.Never change more than one axis at a time unless two axes are both very low and clearly independent. Never trust a single near‑margin reading — re‑run on a second seed when two candidates are within 0.03.