Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger
external agent. Each axis reports the target value (ref), the current value (gen)
and the closeness (score) — the target/current/distance an agent needs to
calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/
camera/render) so the action cluster stays discriminative for explicit content.
swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first;
default max_new_tokens 1024. Docs aligned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-26 22:52:40 +02:00
parent aa3983d94a
commit 959ec70065
6 changed files with 188 additions and 164 deletions
+25 -25
View File
@@ -28,12 +28,12 @@
│ Qwen3-VL JUDGE node ── the "vllm node" │
│ in : reference + generated │
│ out: overall_score 0..1 │
│ per-axis scores (cast, clothing, pose, scene,
composition, expression, color/lighting)
diff_analysis (JSON: what's off + how to fix,
phrased in Prompt-Builder axis vocabulary)
│ per-axis {score, ref, gen} over ~20 axes
(identity, body, wardrobe, action, affect,
camera, render) — target vs current values
(local model observes only; no fixes suggested)
└────────────────────┬──────────────────────────────────┘
│ score + diffs
│ score + ref/gen per axis
┌────────────────────▼────────────────┐
│ CALIBRATOR / controller │
│ - accumulate per-axis scores │
@@ -111,30 +111,30 @@ is sequential anyway. The 8B bf16 judge coresides more easily.
## 3. Scoring rubric (what the VLM actually returns)
The judge prompts Qwen3VL to return **strict JSON** with one overall score and a score
per axis, where the axes mirror what PromptBuilder can control. This is what makes the
diff *actionable* instead of generic prose.
The judge prompts Qwen3VL to return **strict JSON** with one overall score and, per axis,
the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
the *target / current / distance* an agent needs to calibrate. The local model only
observes; it suggests no fixes (a stronger external model owns correction).
```json
{
"overall_score": 0.0,
"axes": {
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"},
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"},
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"},
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
},
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
}
}
```
The axis list is **configurable** on the node so it can match whichever PromptBuilder
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
map each one onto a knob.
The axis list is **configurable** on the node. The default ~20 axes are grouped as
identity / body / wardrobe / action / affect / camera / render, kept granular so the
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
stays discriminative for explicit content. The agent steers each low axis's prompt wording
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
### Reducing VLMasjudge variance (important)
@@ -162,10 +162,10 @@ LLM). So "calibration" = **searching the space of `(seed, profile, peraxis ov
to maximize `overall_score`. Three controller options, easiest → strongest:
1. **Greedy peraxis hillclimb (start here).**
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
Implementable today with the PromptBuilder **ForLoop Start/End + Accumulator** nodes.
Take the lowestscoring axis, rewrite that axis's prompt wording toward its `ref`
(target) value, regenerate, keep the change if `overall_score` improved, else revert.
Loop until ≥ target or no axis improves. The agent decides the wording (no machine
fixes). Implementable with the PromptBuilder **ForLoop Start/End + Accumulator** nodes.
2. **Blackbox optimizer over the knob vector.**
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES /