Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:52:40 +02:00
parent aa3983d94a
commit 959ec70065
6 changed files with 188 additions and 164 deletions
@@ -28,12 +28,12 @@
   │ Qwen3-VL JUDGE node  ── the "vllm node"                │
   │  in : reference + generated                            │
   │  out: overall_score 0..1                               │
-   │       per-axis scores  (cast, clothing, pose, scene,   │
-   │         composition, expression, color/lighting)       │
-   │       diff_analysis (JSON: what's off + how to fix,    │
-   │         phrased in Prompt-Builder axis vocabulary)     │
+   │       per-axis {score, ref, gen} over ~20 axes         │
+   │         (identity, body, wardrobe, action, affect,     │
+   │          camera, render) — target vs current values    │
+   │       (local model observes only; no fixes suggested)  │
   └────────────────────┬──────────────────────────────────┘
-                        │ score + diffs
+                        │ score + ref/gen per axis
   ┌────────────────────▼────────────────┐
   │ CALIBRATOR / controller             │
   │  - accumulate per-axis scores        │
@@ -111,30 +111,30 @@ is sequential anyway. The 8B bf16 judge co‑resides more easily.

 ## 3. Scoring rubric (what the VLM actually returns)

-The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and a score
-per axis, where the axes mirror what Prompt‑Builder can control. This is what makes the
-diff *actionable* instead of generic prose.
+The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and, per axis,
+the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
+the *target / current / distance* an agent needs to calibrate. The local model only
+observes; it suggests no fixes (a stronger external model owns correction).

 ```json
 {
  "overall_score": 0.0,
  "axes": {
-    "cast":        {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
-    "clothing":    {"score": 0.0, "diff": "ref lingerie vs gen nude"},
-    "pose":        {"score": 0.0, "diff": "ref standing vs gen seated"},
-    "scene":       {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
-    "composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
-    "expression":  {"score": 0.0, "diff": "ref smiling vs gen neutral"},
-    "color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
-  },
-  "fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
+    "subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
+    "position":      {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
+    "clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
+    "scene":         {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
+    "framing":       {"score": 0.6, "ref": "full body", "gen": "close-up"},
+    "lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
+  }
 }
 ```

-The axis list is **configurable** on the node so it can match whichever Prompt‑Builder
-knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
-color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
-map each one onto a knob.
+The axis list is **configurable** on the node. The default ~20 axes are grouped as
+identity / body / wardrobe / action / affect / camera / render, kept granular so the
+*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
+stays discriminative for explicit content. The agent steers each low axis's prompt wording
+toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).

 ### Reducing VLM‑as‑judge variance (important)

@@ -162,10 +162,10 @@ LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis ov
 to maximize `overall_score`. Three controller options, easiest → strongest:

 1. **Greedy per‑axis hill‑climb (start here).**
-   For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
-   override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
-   if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
-   Implementable today with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
+   Take the lowest‑scoring axis, rewrite that axis's prompt wording toward its `ref`
+   (target) value, regenerate, keep the change if `overall_score` improved, else revert.
+   Loop until ≥ target or no axis improves. The agent decides the wording (no machine
+   fixes). Implementable with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.

 2. **Black‑box optimizer over the knob vector.**
   Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /