Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger
external agent. Each axis reports the target value (ref), the current value (gen)
and the closeness (score) — the target/current/distance an agent needs to
calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/
camera/render) so the action cluster stays discriminative for explicit content.
swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first;
default max_new_tokens 1024. Docs aligned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-26 22:52:40 +02:00
parent aa3983d94a
commit 959ec70065
6 changed files with 188 additions and 164 deletions
+3 -3
View File
@@ -34,7 +34,7 @@ can act on it.
| `generated_image` | IMAGE | — | the candidate to score | | `generated_image` | IMAGE | — | the candidate to score |
| `model_path` | STRING | `/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16` | local dir, **HF repo id** (`org/name`), or alias (`30b-a3b` / `8b` / `4b`) | | `model_path` | STRING | `/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16` | local dir, **HF repo id** (`org/name`), or alias (`30b-a3b` / `8b` / `4b`) |
| `precision` | bf16 / fp16 / fp8 / nf4 | bf16 | `nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy | | `precision` | bf16 / fp16 / fp8 / nf4 | bf16 | `nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy |
| `axes` | STRING | cast, clothing, pose, scene, composition, expression, color_light | scored axes (match your Prompt-Builder knobs) | | `axes` | STRING | ~20 axes (identity, body, wardrobe, action, affect, camera, render) | scored axes; granular for explicit content. Edit to taste |
| `max_new_tokens` | INT | 512 | | | `max_new_tokens` | INT | 512 | |
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable | | `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias | | `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
@@ -51,8 +51,8 @@ default skip download entirely.
| name | type | use | | name | type | use |
|---|---|---| |---|---|---|
| `overall_score` | FLOAT 0..1 | loop stop-condition / objective | | `overall_score` | FLOAT 0..1 | loop stop-condition / objective |
| `axis_scores_json` | STRING (JSON) | per-axis `{score, diff}` for the controller | | `axis_scores_json` | STRING (JSON) | per-axis `{score, ref, gen}` — target vs current, for the agent |
| `diff_analysis` | STRING | human/controller-readable summary + fix suggestions | | `diff_analysis` | STRING | readable summary, worst axes first (`score ref:[…] gen:[…]`) |
| `raw` | STRING | raw model output (both passes if `swap_eval`) | | `raw` | STRING | raw model output (both passes if `swap_eval`) |
## Install ## Install
+1 -1
View File
@@ -19,7 +19,7 @@ Stdlib only — no third-party deps, so any agent can shell out to it.
Loop, from the agent's side: Loop, from the agent's side:
1. build a prompt (calibrate from the previous analysis) 1. build a prompt (calibrate from the previous analysis)
2. run this script -> capture stdout (the analysis JSON) 2. run this script -> capture stdout (the analysis JSON)
3. read overall_score + per-axis diffs + fix_suggestions 3. read overall_score + per-axis {score, ref, gen}
4. adjust the prompt and go to 1, until overall_score >= target 4. adjust the prompt and go to 1, until overall_score >= target
""" """
+11 -12
View File
@@ -18,8 +18,8 @@ reads the analysis, calibrates the prompt generator, and queues the next iterati
│ writes calib_<tag>.json + latest.json │ writes calib_<tag>.json + latest.json
3. poll /history/{id} (bridge does this) ◄───────────┘ 3. poll /history/{id} (bridge does this) ◄───────────┘
4. read report JSON (overall_score, 4. read report JSON (overall_score,
per-axis diffs, fix_suggestions) per-axis score + ref/gen values)
5. adjust Prompt-Builder knobs / prompt 5. steer prompt toward ref on worst axes
└──► go to 1 until overall_score ≥ target └──► go to 1 until overall_score ≥ target
``` ```
@@ -60,23 +60,22 @@ Stdout (captured by the agent) is the report:
"run_tag": "iter003", "run_tag": "iter003",
"overall_score": 0.62, "overall_score": 0.62,
"axes": { "axes": {
"pose": {"score": 0.40, "diff": "ref standing, gen seated"}, "position": {"score": 0.40, "ref": "doggy style", "gen": "missionary"},
"clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"} "clothing_state": {"score": 0.85, "ref": "red lace lingerie", "gen": "plain bra"}
}, },
"fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"], "prompt_used": "...",
"prompt_used": "1 woman, red lingerie, ...",
"_prompt_id": "…", "_report_path": "…/calib_iter003.json" "_prompt_id": "…", "_report_path": "…/calib_iter003.json"
} }
``` ```
## Agent calibration policy (suggested) ## Agent calibration policy (suggested)
The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the For the lowest-scoring axes, the agent rewrites that axis's prompt wording to match its
`fix_suggestions`, regenerates, and keeps changes that raise `overall_score` `ref` value (the target), regenerates, and keeps changes that raise `overall_score`
(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so (greedy per-axis hill-climb). The local model supplies no fixes — the agent owns the
the score reflects the prompt, not sampler noise; vary the seed only once you're near the correction. Keep the **T2I seed fixed** while searching so the score reflects the prompt,
target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every not sampler noise; vary the seed only once near target. Stop at `overall_score ≥ target`
`(prompt, knobs, score)` so the search is auditable/resumable. (e.g. 0.85) or a max-iteration budget. Full policy: **[CALIBRATION_POLICY.md](CALIBRATION_POLICY.md)**.
## Setup checklist ## Setup checklist
+90 -99
View File
@@ -1,135 +1,126 @@
# Calibration policy — the agent's playbook # Calibration policy — the agent's playbook
This is the instruction set the **external CLI agent** (the controller) follows each The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
iteration. Paste the "Agent system prompt" block into your agent, give it the workflow **external agent** (you / a stronger model) decides every correction. So the judge's job
path + reference image + target score, and let it loop. is to hand the agent the *range of information needed to calibrate*, and the agent's job
is to turn that into prompt edits.
The agent calibrates by reasoning over the **PromptBuilder axes** and editing a ## What the agent needs from each comparison (the information model)
structured *axis state*, then **rendering that state to a prompt string** that it injects
into the `CalibratorPromptReceptor`. This keeps the reasoning axisaware while staying
compatible with the flatstring receptor. (If you later switch the receptor to carry a
structured config, the same axis state maps straight onto PromptBuilder's split control
nodes.)
--- To move a generated image toward a reference, for **every dimension the prompt controls**
the agent needs three things:
## Axis state (the agent's working memory) | field | meaning | why the agent needs it |
|---|---|---|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
| `gen` | what the **generated** image shows | the **current** state — what to change |
| `score` | 01 closeness | the **gap / priority** — which axes to fix first |
```json That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
{ prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
"cast": "1 woman, mid-20s, athletic", (`{"score", "ref", "gen"}`) plus a top-level `overall_score`.
"clothing": "red lace lingerie",
"pose": "standing, hand on hip",
"scene": "dimly lit bedroom",
"composition": "full-body shot, slight low angle",
"expression": "soft smile, eye contact",
"color_light": "warm rim light, shallow depth of field",
"quality": "photorealistic, high detail",
"negative": "blurry, deformed, lowres, extra limbs",
"seed": 12345
}
```
These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried The axes must **span what the prompt can express** — you can only fix what the prompt can
but not scored. Render order (subject → wardrobe → action → setting → framing → affect → say, and each diff must map to a lever. The default set (configurable on the node) is
light → quality): grouped below.
``` ## Axes (default set — edit `axes` on the node to taste)
prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
```
--- - **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
- **Wardrobe:** `clothing_state` (degree of undress + garments)
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
`penetration`, `explicitness`, `body_contact`
- **Affect:** `pose`, `facial_expression`, `gaze`
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
- **Render:** `scene`, `lighting_color`, `art_style`
## Periteration algorithm (greedy peraxis hillclimb) Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
interaction cluster granular so the agent gets actionable targets.
## Per-iteration algorithm (greedy per-axis hill-climb)
``` ```
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0 best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop: loop:
i += 1 i += 1
prompt = render(state) prompt = render(state) # state = current value per axis
report = run agent_bridge.py --prompt prompt --negative state.negative report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i} --seed state.seed --run-tag iter{i}
--workflow wf.json --analysis-dir <report_dir> --workflow wf.json --analysis-dir <report_dir>
score = report.overall_score if report.overall_score >= TARGET: stop("converged", state) # e.g. 0.85
if score >= TARGET: # e.g. 0.85 if report.overall_score > best_score:
stop("converged", state, score) best_score = report.overall_score ; best_state = state ; stale = 0
if score > best_score:
best_score = score ; best_state = state ; stale = 0
else: else:
stale += 1 stale += 1 ; state = best_state # revert the change that didn't help
state = best_state # revert: undo the change that didn't help if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25
stop("plateau/budget", best_state, best_score)
# choose the next single edit: worst = axis with the lowest report.axes[*].score
worst_axis = axis with lowest per-axis score in report.axes target_value = report.axes[worst].ref # what the reference shows
edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
state = apply(best_state, worst_axis, edit) # change ONE axis only
``` ```
`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
### Rules that matter ### Rules that matter
1. **Change one axis per iteration.** One edit = clean attribution of the score delta. 1. **Change one axis per iteration** clean attribution of the score delta. Batch two
Only batch two edits when two axes score very low *and* are clearly independent. only when both are very low and clearly independent.
2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not 2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
sampler noise. Vary the seed only after you've converged, to confirm robustness. noise. Vary the seed only after converging, to confirm robustness.
3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the 3. **Always edit from `best_state`**, never from a worse last state.
"revert on no improvement" step. Prevents drifting down a bad path. 4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
4. **Target the lowestscoring axis first**, applying the Judge's matching after a try, try an alternative phrasing for that axis before moving on.
`fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value 5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
for that axis before moving on. orderings; if two candidates are within ~0.03, re-run each on a second seed.
5. **Near the margin, don't overtrust one reading.** `swap_eval` already averages two 6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.
orderings; if two candidates are within ~0.03, rerun each on a second seed and compare
averages before committing.
6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size
(smaller, more specific wording changes) and reanchor on `best_state`.
7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, peraxis)`.
The run must be auditable and resumable.
### Mapping `fix_suggestions` → axes
The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to
clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous,
attribute it to the lowestscoring axis it plausibly affects.
---
## Worked example ## Worked example
``` ```
iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41 iter1 overall=0.41 worst: scene 0.30 ref:[dim bedroom] gen:[bright kitchen]
axes: scene 0.30 (worst) — "ref bedroom, gen kitchen" edit scene → "dimly lit bedroom"
fix: "set scene to a dim bedroom" iter2 overall=0.58 worst: position 0.35 ref:[doggy style] gen:[missionary]
iter2 edit scene→"dimly lit bedroom" score=0.58 (kept) edit position → "doggy style"
axes: pose 0.35 (worst) — "ref standing, gen seated" iter3 overall=0.71 worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
iter3 edit pose→"standing, hand on hip" score=0.71 (kept) edit lighting → "warm low-key lighting" (0.69 → revert)
axes: color_light 0.50 (worst) — "ref warm, gen flat" iter4 overall=0.69 retry lighting → "warm golden low-key glow" (0.84 → keep)
iter4 edit color_light→"warm rim light" score=0.69 (worse → revert) iter5 overall=0.84 worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
iter5 edit color_light→"warm golden hour glow" score=0.83 (kept) edit clothing → "red lace lingerie"
axes: clothing 0.78 (worst) — "gen lacks lace detail" iter6 overall=0.89 ≥ target → STOP
iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP
``` ```
--- ## Report shape the agent reads (`latest.json` / stdout)
```json
{
"run_tag": "iter002",
"overall_score": 0.58,
"axes": {
"position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
"scene": {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
},
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}
```
## Agent system prompt (paste into your CLI agent) ## Agent system prompt (paste into your CLI agent)
> You are the controller for a local image prompt calibrator. Goal: make a generated > You are the controller for a local image prompt calibrator. Goal: make a generated
> image match a reference image, measured by a Qwen3VL judge that scores 7 axes > image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
> (cast, clothing, pose, scene, composition, expression, color_light) from 01. > body, wardrobe, action, affect, camera, render) and for each returns `score` (01
> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
> >
> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a > You hold an **axis state** (current value per axis). Each turn: (1) render it to a
> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ > prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
> quality; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>" > --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
> --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir > (3) read the printed JSON.
> <report_dir>`; (3) read the printed JSON report.
> >
> Then apply greedy peraxis hillclimb: keep the change only if `overall_score` improved, > Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
> else revert to the best state; pick the **lowestscoring axis** and apply the Judge's > revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching. > wording to match its `ref` value (you decide the wording — there are no machine-supplied
> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 nonimproving > fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score. > `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
> > Log every step and report the best prompt + score.
> Never change more than one axis at a time unless two axes are both very low and clearly
> independent. Never trust a single nearmargin reading — rerun on a second seed when two
> candidates are within 0.03.
+25 -25
View File
@@ -28,12 +28,12 @@
│ Qwen3-VL JUDGE node ── the "vllm node" │ │ Qwen3-VL JUDGE node ── the "vllm node" │
│ in : reference + generated │ │ in : reference + generated │
│ out: overall_score 0..1 │ │ out: overall_score 0..1 │
│ per-axis scores (cast, clothing, pose, scene, │ per-axis {score, ref, gen} over ~20 axes
composition, expression, color/lighting) (identity, body, wardrobe, action, affect,
diff_analysis (JSON: what's off + how to fix, camera, render) — target vs current values
phrased in Prompt-Builder axis vocabulary) (local model observes only; no fixes suggested)
└────────────────────┬──────────────────────────────────┘ └────────────────────┬──────────────────────────────────┘
│ score + diffs │ score + ref/gen per axis
┌────────────────────▼────────────────┐ ┌────────────────────▼────────────────┐
│ CALIBRATOR / controller │ │ CALIBRATOR / controller │
│ - accumulate per-axis scores │ │ - accumulate per-axis scores │
@@ -111,30 +111,30 @@ is sequential anyway. The 8B bf16 judge coresides more easily.
## 3. Scoring rubric (what the VLM actually returns) ## 3. Scoring rubric (what the VLM actually returns)
The judge prompts Qwen3VL to return **strict JSON** with one overall score and a score The judge prompts Qwen3VL to return **strict JSON** with one overall score and, per axis,
per axis, where the axes mirror what PromptBuilder can control. This is what makes the the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
diff *actionable* instead of generic prose. the *target / current / distance* an agent needs to calibrate. The local model only
observes; it suggests no fixes (a stronger external model owns correction).
```json ```json
{ {
"overall_score": 0.0, "overall_score": 0.0,
"axes": { "axes": {
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"}, "subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"}, "position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"}, "clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"}, "scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"}, "framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"}, "lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"} }
},
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
} }
``` ```
The axis list is **configurable** on the node so it can match whichever PromptBuilder The axis list is **configurable** on the node. The default ~20 axes are grouped as
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression, identity / body / wardrobe / action / affect / camera / render, kept granular so the
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can *action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
map each one onto a knob. stays discriminative for explicit content. The agent steers each low axis's prompt wording
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
### Reducing VLMasjudge variance (important) ### Reducing VLMasjudge variance (important)
@@ -162,10 +162,10 @@ LLM). So "calibration" = **searching the space of `(seed, profile, peraxis ov
to maximize `overall_score`. Three controller options, easiest → strongest: to maximize `overall_score`. Three controller options, easiest → strongest:
1. **Greedy peraxis hillclimb (start here).** 1. **Greedy peraxis hillclimb (start here).**
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob Take the lowestscoring axis, rewrite that axis's prompt wording toward its `ref`
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change (target) value, regenerate, keep the change if `overall_score` improved, else revert.
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves. Loop until ≥ target or no axis improves. The agent decides the wording (no machine
Implementable today with the PromptBuilder **ForLoop Start/End + Accumulator** nodes. fixes). Implementable with the PromptBuilder **ForLoop Start/End + Accumulator** nodes.
2. **Blackbox optimizer over the knob vector.** 2. **Blackbox optimizer over the knob vector.**
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES / Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES /
+58 -24
View File
@@ -41,7 +41,36 @@ RECOMMENDED_MODELS = {
"4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated", "4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated",
} }
DEFAULT_AXES = "cast, clothing, pose, scene, composition, expression, color_light" # Difference axes the judge scores. Granular by default so the comparison is
# discriminative for explicit/adult imagery (where coarse axes blur the differences
# that matter). Fully configurable on the node — trim or extend per use case.
# subject_count number of people
# gender_mix gender composition (e.g. 1F, 2F1M)
# body_type physique / build / proportions per subject
# distinctive_features tattoos / piercings / marks (identity anchors)
# age_appearance apparent age
# ethnicity_skin ethnicity / skin tone
# hair length, color, style
# clothing_state degree of undress + specific garments
# sexual_act the act / activity being performed
# position sexual position / arrangement of bodies
# penetration type & visibility of penetration
# explicitness how graphic / genital visibility level
# body_contact who contacts whom; interaction between subjects
# pose non-act body positioning
# facial_expression face / affect
# gaze eye contact / look direction
# framing shot type / crop (close-up <-> full body)
# camera_angle POV / angle / perspective
# scene location / setting / background
# lighting_color palette, lighting, color grade
# art_style photoreal vs anime/illustrated, render style
DEFAULT_AXES = (
"subject_count, gender_mix, body_type, distinctive_features, age_appearance, "
"ethnicity_skin, hair, clothing_state, sexual_act, position, penetration, "
"explicitness, body_contact, pose, facial_expression, gaze, framing, "
"camera_angle, scene, lighting_color, art_style"
)
# Cache loaded (model, processor) keyed by (path, precision) so the loop does not # Cache loaded (model, processor) keyed by (path, precision) so the loop does not
# reload weights every iteration. # reload weights every iteration.
@@ -196,27 +225,31 @@ def _ensure_chat_template(processor, model_path: str):
def _build_system_prompt(axes: list[str]) -> str: def _build_system_prompt(axes: list[str]) -> str:
axis_lines = "\n".join(f' "{a}": {{"score": <0..1>, "diff": "<short note>"}},' for a in axes) axis_lines = "\n".join(
f' "{a}": {{"score": <0..1>, "ref": "<what IMAGE 1 shows>", "gen": "<what IMAGE 2 shows>"}},'
for a in axes)
return ( return (
"You are a meticulous visual-similarity judge for an image-generation " "You are a meticulous visual-similarity judge for an image-generation "
"calibration loop. You are shown two images: IMAGE 1 is the REFERENCE " "calibration loop. You are shown two images: IMAGE 1 is the REFERENCE "
"(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely " "(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely "
"the GENERATED image reproduces the REFERENCE.\n\n" "the GENERATED image reproduces the REFERENCE.\n\n"
"Score each axis from 0 to 1 using this anchored rubric:\n" "For every axis report THREE things:\n"
" 0.0 = unrelated; 0.5 = same general category but clearly different " " - ref: concretely what IMAGE 1 (reference / target) shows for this axis\n"
"details; 1.0 = near-identical.\n" " - gen: concretely what IMAGE 2 (generated) shows for this axis\n"
"For each axis, FIRST note the concrete difference, THEN assign the number.\n\n" " - score: 0..1 closeness, where 0.0 = unrelated, 0.5 = same general "
"category but clearly different details, 1.0 = near-identical.\n"
"Use specific concrete values (e.g. ref 'doggy style', gen 'missionary'), "
"not vague notes. Describe ONLY what you observe — do NOT suggest fixes or "
"prompt changes; correction is handled by a separate model.\n\n"
"Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n" "Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n"
"{\n" "{\n"
' "overall_score": <0..1>,\n' ' "overall_score": <0..1>,\n'
' "axes": {\n' ' "axes": {\n'
f"{axis_lines}\n" f"{axis_lines}\n"
" },\n"
' "fix_suggestions": ["<actionable change to the generation prompt>", ...]\n'
" }\n" " }\n"
"Phrase every diff and fix in terms of the named axes " "}\n"
"(cast/clothing/pose/scene/composition/expression/color_light). " "overall_score must be consistent with the per-axis scores. If an axis is "
"overall_score must be consistent with the per-axis scores." "not applicable to either image, set score 1.0 and ref/gen to \"n/a\"."
) )
@@ -311,7 +344,7 @@ def _merge_swapped(a: dict, b: dict) -> dict:
return a return a
if not a: if not a:
return b return b
out = {"axes": {}, "fix_suggestions": []} out = {"axes": {}}
out["overall_score"] = round( out["overall_score"] = round(
(float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4 (float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4
) )
@@ -320,9 +353,11 @@ def _merge_swapped(a: dict, b: dict) -> dict:
sa = a.get("axes", {}).get(ax, {}) sa = a.get("axes", {}).get(ax, {})
sb = b.get("axes", {}).get(ax, {}) sb = b.get("axes", {}).get(ax, {})
score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0 score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0
diff = sa.get("diff") or sb.get("diff") or "" # In pass b the images were swapped, so b.ref describes the generated image
out["axes"][ax] = {"score": round(score, 4), "diff": diff} # and b.gen the reference -> invert b when falling back.
out["fix_suggestions"] = (a.get("fix_suggestions") or []) + (b.get("fix_suggestions") or []) ref = sa.get("ref") or sb.get("gen") or ""
gen = sa.get("gen") or sb.get("ref") or ""
out["axes"][ax] = {"score": round(score, 4), "ref": ref, "gen": gen}
return out return out
@@ -352,7 +387,6 @@ def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all,
"run_tag": run_tag, "run_tag": run_tag,
"overall_score": round(float(overall), 4), "overall_score": round(float(overall), 4),
"axes": (merged or {}).get("axes", {}), "axes": (merged or {}).get("axes", {}),
"fix_suggestions": (merged or {}).get("fix_suggestions", []),
"diff_analysis": diff_analysis, "diff_analysis": diff_analysis,
"prompt_used": prompt_used, "prompt_used": prompt_used,
"raw": raw_all, "raw": raw_all,
@@ -395,7 +429,7 @@ class QwenVLImageJudge:
"model_path": ("STRING", {"default": DEFAULT_MODEL_PATH}), "model_path": ("STRING", {"default": DEFAULT_MODEL_PATH}),
"precision": (["bf16", "fp16", "fp8", "nf4"], {"default": "bf16"}), "precision": (["bf16", "fp16", "fp8", "nf4"], {"default": "bf16"}),
"axes": ("STRING", {"default": DEFAULT_AXES, "multiline": True}), "axes": ("STRING", {"default": DEFAULT_AXES, "multiline": True}),
"max_new_tokens": ("INT", {"default": 512, "min": 64, "max": 4096}), "max_new_tokens": ("INT", {"default": 1024, "min": 64, "max": 4096}),
"temperature": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.5, "step": 0.05}), "temperature": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.5, "step": 0.05}),
"swap_eval": ("BOOLEAN", {"default": True}), "swap_eval": ("BOOLEAN", {"default": True}),
}, },
@@ -448,13 +482,13 @@ class QwenVLImageJudge:
overall = float(merged.get("overall_score", 0.0)) if merged else 0.0 overall = float(merged.get("overall_score", 0.0)) if merged else 0.0
axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}" axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}"
# Human/controller-readable diff summary. # Human/controller-readable diff summary, worst axes first (biggest gap).
diff_lines = [] items = sorted((merged.get("axes", {}) if merged else {}).items(),
for ax, info in (merged.get("axes", {}) if merged else {}).items(): key=lambda kv: float(kv[1].get("score", 0)))
diff_lines.append(f"- {ax}: {info.get('score', 0):.2f}{info.get('diff', '')}") diff_lines = [
fixes = merged.get("fix_suggestions", []) if merged else [] f"- {ax}: {info.get('score', 0):.2f} ref:[{info.get('ref', '')}] gen:[{info.get('gen', '')}]"
if fixes: for ax, info in items
diff_lines.append("fixes: " + "; ".join(str(f) for f in fixes)) ]
diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)" diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)"
report_path = _write_report( report_path = _write_report(