Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions
The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+25
-25
@@ -28,12 +28,12 @@
|
||||
│ Qwen3-VL JUDGE node ── the "vllm node" │
|
||||
│ in : reference + generated │
|
||||
│ out: overall_score 0..1 │
|
||||
│ per-axis scores (cast, clothing, pose, scene, │
|
||||
│ composition, expression, color/lighting) │
|
||||
│ diff_analysis (JSON: what's off + how to fix, │
|
||||
│ phrased in Prompt-Builder axis vocabulary) │
|
||||
│ per-axis {score, ref, gen} over ~20 axes │
|
||||
│ (identity, body, wardrobe, action, affect, │
|
||||
│ camera, render) — target vs current values │
|
||||
│ (local model observes only; no fixes suggested) │
|
||||
└────────────────────┬──────────────────────────────────┘
|
||||
│ score + diffs
|
||||
│ score + ref/gen per axis
|
||||
┌────────────────────▼────────────────┐
|
||||
│ CALIBRATOR / controller │
|
||||
│ - accumulate per-axis scores │
|
||||
@@ -111,30 +111,30 @@ is sequential anyway. The 8B bf16 judge co‑resides more easily.
|
||||
|
||||
## 3. Scoring rubric (what the VLM actually returns)
|
||||
|
||||
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and a score
|
||||
per axis, where the axes mirror what Prompt‑Builder can control. This is what makes the
|
||||
diff *actionable* instead of generic prose.
|
||||
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and, per axis,
|
||||
the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
|
||||
the *target / current / distance* an agent needs to calibrate. The local model only
|
||||
observes; it suggests no fixes (a stronger external model owns correction).
|
||||
|
||||
```json
|
||||
{
|
||||
"overall_score": 0.0,
|
||||
"axes": {
|
||||
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
|
||||
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"},
|
||||
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"},
|
||||
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
|
||||
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
|
||||
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"},
|
||||
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
|
||||
},
|
||||
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
|
||||
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
|
||||
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
|
||||
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
|
||||
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
|
||||
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
|
||||
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The axis list is **configurable** on the node so it can match whichever Prompt‑Builder
|
||||
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
|
||||
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
|
||||
map each one onto a knob.
|
||||
The axis list is **configurable** on the node. The default ~20 axes are grouped as
|
||||
identity / body / wardrobe / action / affect / camera / render, kept granular so the
|
||||
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
|
||||
stays discriminative for explicit content. The agent steers each low axis's prompt wording
|
||||
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||||
|
||||
### Reducing VLM‑as‑judge variance (important)
|
||||
|
||||
@@ -162,10 +162,10 @@ LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis ov
|
||||
to maximize `overall_score`. Three controller options, easiest → strongest:
|
||||
|
||||
1. **Greedy per‑axis hill‑climb (start here).**
|
||||
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
|
||||
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
|
||||
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
|
||||
Implementable today with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
||||
Take the lowest‑scoring axis, rewrite that axis's prompt wording toward its `ref`
|
||||
(target) value, regenerate, keep the change if `overall_score` improved, else revert.
|
||||
Loop until ≥ target or no axis improves. The agent decides the wording (no machine
|
||||
fixes). Implementable with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
||||
|
||||
2. **Black‑box optimizer over the knob vector.**
|
||||
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /
|
||||
|
||||
Reference in New Issue
Block a user