Switch compare to discrete verdicts + granular pose axes + per-axis definitions

The 4B's 0-1 scores were unreliable (identical ref/gen scored ~0.6), so the judge now returns verdict match/partial/mismatch per axis; overall_score and a new mismatch_count are computed from verdicts on our side (reliable, monotonic). Expanded the action/pose cluster into position_name, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility (+ breast_size) so explicit poses carry detail. Each axis now ships a one-line definition in the prompt so gender_mix/subject_count stop absorbing positional text. 24 axes total. Example workflows use the node default (axes=''). Docs realigned; stop condition is now mismatch_count==0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 23:15:51 +02:00
parent c7ef756a71
commit 53f1f9b9b4
7 changed files with 165 additions and 117 deletions
@@ -118,23 +118,26 @@ observes; it suggests no fixes (a stronger external model owns correction).

 ```json
 {
-  "overall_score": 0.0,
  "axes": {
-    "subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
-    "position":      {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
-    "clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
-    "scene":         {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
-    "framing":       {"score": 0.6, "ref": "full body", "gen": "close-up"},
-    "lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
+    "subject_count":  {"verdict": "match",    "ref": "1 woman", "gen": "1 woman"},
+    "position_name":  {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
+    "clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
+    "scene":          {"verdict": "partial",  "ref": "dim bedroom", "gen": "lit bedroom"},
+    "lighting_color": {"verdict": "match",    "ref": "warm low-key", "gen": "warm low-key"}
  }
 }
 ```

-The axis list is **configurable** on the node. The default ~20 axes are grouped as
-identity / body / wardrobe / action / affect / camera / render, kept granular so the
-*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
-stays discriminative for explicit content. The agent steers each low axis's prompt wording
-toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
+A **discrete verdict** (match/partial/mismatch) is used instead of a 0–1 score: small VLMs
+give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
+buckets reliably. `overall_score` + `mismatch_count` are computed from the verdicts on our
+side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
+**configurable**; the default ~24 axes are grouped identity / body / wardrobe / action·pose
+/ affect / camera / render, with the action·pose cluster split fine (`sexual_act`,
+`position_name`, `body_orientation`, `limb_arrangement`, `penetration`, `contact_points`,
+`genital_visibility`) so it stays discriminative for explicit content. Each axis carries a
+one-line definition in the prompt. The agent steers each `mismatch`/`partial` axis toward
+its `ref`. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).

 ### Reducing VLM‑as‑judge variance (important)