Switch compare to discrete verdicts + granular pose axes + per-axis definitions
The 4B's 0-1 scores were unreliable (identical ref/gen scored ~0.6), so the judge now returns verdict match/partial/mismatch per axis; overall_score and a new mismatch_count are computed from verdicts on our side (reliable, monotonic). Expanded the action/pose cluster into position_name, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility (+ breast_size) so explicit poses carry detail. Each axis now ships a one-line definition in the prompt so gender_mix/subject_count stop absorbing positional text. 24 axes total. Example workflows use the node default (axes=''). Docs realigned; stop condition is now mismatch_count==0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+15
-12
@@ -118,23 +118,26 @@ observes; it suggests no fixes (a stronger external model owns correction).
|
||||
|
||||
```json
|
||||
{
|
||||
"overall_score": 0.0,
|
||||
"axes": {
|
||||
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
|
||||
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
|
||||
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
|
||||
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
|
||||
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
|
||||
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
|
||||
"subject_count": {"verdict": "match", "ref": "1 woman", "gen": "1 woman"},
|
||||
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
|
||||
"clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
|
||||
"scene": {"verdict": "partial", "ref": "dim bedroom", "gen": "lit bedroom"},
|
||||
"lighting_color": {"verdict": "match", "ref": "warm low-key", "gen": "warm low-key"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The axis list is **configurable** on the node. The default ~20 axes are grouped as
|
||||
identity / body / wardrobe / action / affect / camera / render, kept granular so the
|
||||
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
|
||||
stays discriminative for explicit content. The agent steers each low axis's prompt wording
|
||||
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||||
A **discrete verdict** (match/partial/mismatch) is used instead of a 0–1 score: small VLMs
|
||||
give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
|
||||
buckets reliably. `overall_score` + `mismatch_count` are computed from the verdicts on our
|
||||
side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
|
||||
**configurable**; the default ~24 axes are grouped identity / body / wardrobe / action·pose
|
||||
/ affect / camera / render, with the action·pose cluster split fine (`sexual_act`,
|
||||
`position_name`, `body_orientation`, `limb_arrangement`, `penetration`, `contact_points`,
|
||||
`genital_visibility`) so it stays discriminative for explicit content. Each axis carries a
|
||||
one-line definition in the prompt. The agent steers each `mismatch`/`partial` axis toward
|
||||
its `ref`. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||||
|
||||
### Reducing VLM‑as‑judge variance (important)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user