Switch compare to discrete verdicts + granular pose axes + per-axis definitions

The 4B's 0-1 scores were unreliable (identical ref/gen scored ~0.6), so the
judge now returns verdict match/partial/mismatch per axis; overall_score and a
new mismatch_count are computed from verdicts on our side (reliable, monotonic).
Expanded the action/pose cluster into position_name, body_orientation,
limb_arrangement, penetration, contact_points, genital_visibility (+ breast_size)
so explicit poses carry detail. Each axis now ships a one-line definition in the
prompt so gender_mix/subject_count stop absorbing positional text. 24 axes total.
Example workflows use the node default (axes=''). Docs realigned; stop condition
is now mismatch_count==0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-26 23:15:51 +02:00
parent c7ef756a71
commit 53f1f9b9b4
7 changed files with 165 additions and 117 deletions
+15 -12
View File
@@ -118,23 +118,26 @@ observes; it suggests no fixes (a stronger external model owns correction).
```json
{
"overall_score": 0.0,
"axes": {
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
"subject_count": {"verdict": "match", "ref": "1 woman", "gen": "1 woman"},
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
"clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
"scene": {"verdict": "partial", "ref": "dim bedroom", "gen": "lit bedroom"},
"lighting_color": {"verdict": "match", "ref": "warm low-key", "gen": "warm low-key"}
}
}
```
The axis list is **configurable** on the node. The default ~20 axes are grouped as
identity / body / wardrobe / action / affect / camera / render, kept granular so the
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
stays discriminative for explicit content. The agent steers each low axis's prompt wording
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
A **discrete verdict** (match/partial/mismatch) is used instead of a 01 score: small VLMs
give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
buckets reliably. `overall_score` + `mismatch_count` are computed from the verdicts on our
side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
**configurable**; the default ~24 axes are grouped identity / body / wardrobe / action·pose
/ affect / camera / render, with the action·pose cluster split fine (`sexual_act`,
`position_name`, `body_orientation`, `limb_arrangement`, `penetration`, `contact_points`,
`genital_visibility`) so it stays discriminative for explicit content. Each axis carries a
one-line definition in the prompt. The agent steers each `mismatch`/`partial` axis toward
its `ref`. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
### Reducing VLMasjudge variance (important)