The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.3 KiB
Calibration policy — the agent's playbook
The local Qwen3-VL judge only observes and scores — it does not propose fixes. The external agent (you / a stronger model) decides every correction. So the judge's job is to hand the agent the range of information needed to calibrate, and the agent's job is to turn that into prompt edits.
What the agent needs from each comparison (the information model)
To move a generated image toward a reference, for every dimension the prompt controls the agent needs three things:
| field | meaning | why the agent needs it |
|---|---|---|
ref |
what the reference shows on this axis | the target — what to steer the prompt toward |
gen |
what the generated image shows | the current state — what to change |
score |
0–1 closeness | the gap / priority — which axes to fix first |
That's the whole signal: target, current, distance. The agent corrects by rewriting the
prompt so gen → ref on the lowest-scoring axes. The judge returns exactly this per axis
({"score", "ref", "gen"}) plus a top-level overall_score.
The axes must span what the prompt can express — you can only fix what the prompt can say, and each diff must map to a lever. The default set (configurable on the node) is grouped below.
Axes (default set — edit axes on the node to taste)
- Identity / cast:
subject_count,gender_mix,age_appearance,ethnicity_skin - Body:
body_type,distinctive_features(tattoos/piercings/marks),hair - Wardrobe:
clothing_state(degree of undress + garments) - Action (where explicit content concentrates):
sexual_act,position,penetration,explicitness,body_contact - Affect:
pose,facial_expression,gaze - Camera:
framing(shot/crop),camera_angle(POV/angle) - Render:
scene,lighting_color,art_style
Coarse axes blur the differences that matter for adult imagery; this set keeps the act / interaction cluster granular so the agent gets actionable targets.
Per-iteration algorithm (greedy per-axis hill-climb)
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state) # state = current value per axis
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--workflow wf.json --analysis-dir <report_dir>
if report.overall_score >= TARGET: stop("converged", state) # e.g. 0.85
if report.overall_score > best_score:
best_score = report.overall_score ; best_state = state ; stale = 0
else:
stale += 1 ; state = best_state # revert the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
worst = axis with the lowest report.axes[*].score
target_value = report.axes[worst].ref # what the reference shows
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
edit_toward(ref) is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. gen:[missionary] → ref:[doggy style] ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
Rules that matter
- Change one axis per iteration — clean attribution of the score delta. Batch two only when both are very low and clearly independent.
- Freeze
seedwhile searching — the score must reflect the prompt, not sampler noise. Vary the seed only after converging, to confirm robustness. - Always edit from
best_state, never from a worse last state. - Steer toward
refon the worst axis; if the obvious wording doesn't move the score after a try, try an alternative phrasing for that axis before moving on. - Near the margin, don't over-trust one reading.
swap_evalalready averages two orderings; if two candidates are within ~0.03, re-run each on a second seed. - Log every step:
(iter, axis_changed, old→new, overall_score, worst-axes).
Worked example
iter1 overall=0.41 worst: scene 0.30 ref:[dim bedroom] gen:[bright kitchen]
edit scene → "dimly lit bedroom"
iter2 overall=0.58 worst: position 0.35 ref:[doggy style] gen:[missionary]
edit position → "doggy style"
iter3 overall=0.71 worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
edit lighting → "warm low-key lighting" (0.69 → revert)
iter4 overall=0.69 retry lighting → "warm golden low-key glow" (0.84 → keep)
iter5 overall=0.84 worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
edit clothing → "red lace lingerie"
iter6 overall=0.89 ≥ target → STOP
Report shape the agent reads (latest.json / stdout)
{
"run_tag": "iter002",
"overall_score": 0.58,
"axes": {
"position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
"scene": {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
},
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}
Agent system prompt (paste into your CLI agent)
You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity, body, wardrobe, action, affect, camera, render) and for each returns
score(0–1 closeness),ref(what the reference shows) andgen(what the generated shows).You hold an axis state (current value per axis). Each turn: (1) render it to a prompt string; (2) run
python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON.Then greedy per-axis hill-climb: keep the change only if
overall_scoreimproved, else revert to the best state; pick the lowest-scoring axis and rewrite that axis's prompt wording to match itsrefvalue (you decide the wording — there are no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop atoverall_score ≥ TARGET(default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step and report the best prompt + score.