Describe mode now produces a single coherent, internally-consistent canonical scene description (paragraph + per-axis spec, written to canonical_reference in the report). Compare gains an optional reference_description input: when set, it anchors on that fixed text and shows only the generated image (no swap) — so the reference side never drifts or self-contradicts across iterations; only the generated image is re-described each turn. agent_bridge gains --ref-desc / --ref-desc-file (reads the describe report's canonical_reference). Docs + example workflow updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.3 KiB
Calibration policy — the agent's playbook
The local Qwen3-VL judge only observes and scores — it does not propose fixes. The external agent (you / a stronger model) decides every correction. So the judge's job is to hand the agent the range of information needed to calibrate, and the agent's job is to turn that into prompt edits.
What the agent needs from each comparison (the information model)
To move a generated image toward a reference, for every dimension the prompt controls the agent needs three things:
| field | meaning | why the agent needs it |
|---|---|---|
ref |
what the reference shows on this axis | the target — what to steer the prompt toward |
gen |
what the generated image shows | the current state — what to change |
verdict |
match / partial / mismatch |
which axes to fix first (mismatch → partial → match) |
That's the whole signal: target, current, distance. The agent corrects by rewriting the
prompt so gen → ref on the mismatch (then partial) axes. The judge returns
{"verdict", "ref", "gen"} per axis. A discrete verdict is used because small VLMs give
unreliable 0–1 scores (identical ref/gen often scored 0.6) but classify match/partial/
mismatch reliably. overall_score and mismatch_count are computed from the verdicts on
our side (mean ordinal), so they're monotonic and trustworthy as a stop signal.
The axes must span what the prompt can express — you can only fix what the prompt can say, and each diff must map to a lever. The default set (configurable on the node) is grouped below.
Axes (default set — edit axes on the node to taste)
- Identity / cast:
subject_count,gender_mix,age_appearance,ethnicity_skin - Body:
body_type,breast_size,distinctive_features(tattoos/piercings/marks),hair - Wardrobe:
clothing_state(degree of undress + garments) - Action / pose (where explicit content concentrates — kept granular):
sexual_act,position_name(doggy/cowgirl/…),body_orientation(on top/from behind/…),limb_arrangement(legs spread/raised, hands),penetration(type/depth/angle),contact_points,genital_visibility,pose(torso/head lean) - Affect:
facial_expression,gaze - Camera:
framing(shot/crop),camera_angle(POV/angle) - Render:
scene,lighting_color,art_style
Each axis carries a one-line definition in the prompt (so e.g. gender_mix is a count,
not a position). Coarse axes blur the differences that matter for adult imagery; the act /
pose cluster is split into many axes so the agent gets specific, actionable targets.
Step 0 — first pass (describe / bootstrap)
The very first iteration has no generated image yet, so the judge runs in describe mode: it looks at the reference alone and emits one canonical scene description — a coherent, internally-consistent paragraph plus a per-axis target spec. That seeds everything and becomes the fixed reference for the whole loop:
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
--run-tag seed --analysis-dir <report_dir>
→ calib_seed.json = {"mode":"describe", "description":"…", "axes":{axis:value,…}, "canonical_reference":"…"}
The agent takes description as the initial prompt and axes as the initial
axis_state. Crucially, the compare loop then anchors on this canonical reference
(via --ref-desc-file) instead of re-reading the reference image every iteration — so the
ref side never drifts or contradicts itself across passes; only the generated image is
re-described each turn.
Per-iteration algorithm (greedy per-axis hill-climb)
best = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state) # state = current value per axis
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--ref-desc-file <report_dir>/calib_seed.json # anchor on canonical ref
--workflow wf.json --analysis-dir <report_dir>
if report.mismatch_count == 0 and report.overall_score >= TARGET:
stop("converged", state) # TARGET e.g. 0.9 (mostly match)
if report.overall_score > best:
best = report.overall_score ; best_state = state ; stale = 0
else:
stale += 1 ; state = best_state # revert the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
worst = a `mismatch` axis (else a `partial` axis) from report.axes
target_value = report.axes[worst].ref # what the reference shows
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
edit_toward(ref) is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. gen:[missionary] → ref:[doggy style] ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
Rules that matter
- Change one axis per iteration — clean attribution of the delta. Batch two only when
both are
mismatchand clearly independent. - Freeze
seedwhile searching — the score must reflect the prompt, not sampler noise. Vary the seed only after converging, to confirm robustness. - Always edit from
best_state, never from a worse last state. - Prioritize
mismatchaxes, thenpartial. Steer towardref; if the obvious wording doesn't flip the verdict, try an alternative phrasing before moving on. - Trust the verdict + the ref/gen text, not fine score deltas. The overall score is a
coarse mean; use
mismatch_countfalling as the real progress signal. - Log every step:
(iter, axis_changed, old→new, overall_score, mismatch_count).
Worked example
iter1 overall=0.55 mism=6 worst: scene MISMATCH ref:[dim bedroom] gen:[bright kitchen]
edit scene → "dimly lit bedroom"
iter2 overall=0.63 mism=5 worst: position_name MISMATCH ref:[doggy style] gen:[cowgirl]
edit position → "doggy style, from behind"
iter3 overall=0.71 mism=3 worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight]
edit lighting → "warm low-key lighting" (mism=4 → revert)
iter4 retry lighting → "warm golden low-key glow" (mism=2 → keep, overall=0.82)
iter5 overall=0.88 mism=1 worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long]
edit hair → "curly shoulder-length brown hair"
iter6 overall=0.93 mism=0 ≥ target → STOP
Report shape the agent reads (latest.json / stdout)
{
"run_tag": "iter002",
"overall_score": 0.63,
"mismatch_count": 5,
"axes": {
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
"scene": {"verdict": "match", "ref": "dim bedroom", "gen": "dim bedroom"}
},
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}
Agent system prompt (paste into your CLI agent)
You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference, measured by a Qwen3-VL judge that compares ~24 axes (identity, body, wardrobe, action/pose, affect, camera, render) and for each returns a
verdict(match / partial / mismatch),ref(what the reference shows) andgen(what the generated shows).overall_scoreandmismatch_countare computed from the verdicts.You hold an axis state (current value per axis). Each turn: (1) render it to a prompt string; (2) run
python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON.Then greedy per-axis hill-climb: keep the change only if
overall_scoreimproved, else revert to the best state; pick a mismatch axis (else a partial axis) and rewrite that axis's prompt wording to match itsrefvalue (you decide the wording — there are no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop whenmismatch_count == 0andoverall_score ≥ TARGET(default 0.9), or after PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.