Switch compare to discrete verdicts + granular pose axes + per-axis definitions

The 4B's 0-1 scores were unreliable (identical ref/gen scored ~0.6), so the judge now returns verdict match/partial/mismatch per axis; overall_score and a new mismatch_count are computed from verdicts on our side (reliable, monotonic). Expanded the action/pose cluster into position_name, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility (+ breast_size) so explicit poses carry detail. Each axis now ships a one-line definition in the prompt so gender_mix/subject_count stop absorbing positional text. 24 axes total. Example workflows use the node default (axes=''). Docs realigned; stop condition is now mismatch_count==0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 23:15:51 +02:00
parent c7ef756a71
commit 53f1f9b9b4
7 changed files with 165 additions and 117 deletions
@@ -59,9 +59,10 @@ Stdout (captured by the agent) is the report:
 {
  "run_tag": "iter003",
  "overall_score": 0.62,
+  "mismatch_count": 1,
  "axes": {
-    "position":       {"score": 0.40, "ref": "doggy style", "gen": "missionary"},
-    "clothing_state": {"score": 0.85, "ref": "red lace lingerie", "gen": "plain bra"}
+    "position_name":  {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
+    "clothing_state": {"verdict": "partial",  "ref": "red lace lingerie", "gen": "plain bra"}
  },
  "prompt_used": "...",
  "_prompt_id": "…", "_report_path": "…/calib_iter003.json"
@@ -14,11 +14,14 @@ the agent needs three things:
 |---|---|---|
 | `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
 | `gen` | what the **generated** image shows | the **current** state — what to change |
-| `score` | 0–1 closeness | the **gap / priority** — which axes to fix first |
+| `verdict` | `match` / `partial` / `mismatch` | which axes to fix first (mismatch → partial → match) |

 That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
-prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
-(`{"score", "ref", "gen"}`) plus a top-level `overall_score`.
+prompt so `gen → ref` on the **mismatch** (then `partial`) axes. The judge returns
+`{"verdict", "ref", "gen"}` per axis. A discrete verdict is used because small VLMs give
+**unreliable 0–1 scores** (identical ref/gen often scored 0.6) but classify match/partial/
+mismatch reliably. `overall_score` and `mismatch_count` are computed **from the verdicts on
+our side** (mean ordinal), so they're monotonic and trustworthy as a stop signal.

 The axes must **span what the prompt can express** — you can only fix what the prompt can
 say, and each diff must map to a lever. The default set (configurable on the node) is
@@ -27,16 +30,19 @@ grouped below.
 ## Axes (default set — edit `axes` on the node to taste)

 - **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
+- **Body:** `body_type`, `breast_size`, `distinctive_features` (tattoos/piercings/marks), `hair`
 - **Wardrobe:** `clothing_state` (degree of undress + garments)
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
-  `penetration`, `explicitness`, `body_contact`
- **Affect:** `pose`, `facial_expression`, `gaze`
+- **Action / pose (where explicit content concentrates — kept granular):** `sexual_act`,
+  `position_name` (doggy/cowgirl/…), `body_orientation` (on top/from behind/…),
+  `limb_arrangement` (legs spread/raised, hands), `penetration` (type/depth/angle),
+  `contact_points`, `genital_visibility`, `pose` (torso/head lean)
+- **Affect:** `facial_expression`, `gaze`
 - **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
 - **Render:** `scene`, `lighting_color`, `art_style`

-Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
-interaction cluster granular so the agent gets actionable targets.
+Each axis carries a one-line definition in the prompt (so e.g. `gender_mix` is a *count*,
+not a position). Coarse axes blur the differences that matter for adult imagery; the act /
+pose cluster is split into many axes so the agent gets specific, actionable targets.

 ## Step 0 — first pass (describe / bootstrap)

@@ -57,21 +63,22 @@ written by hand — the VL provides the target to reproduce.
 ## Per-iteration algorithm (greedy per-axis hill-climb)

 ```
-best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
+best = -1 ; best_state = initial_state ; stale = 0 ; i = 0
 loop:
  i += 1
  prompt = render(state)                       # state = current value per axis
  report = run agent_bridge.py --prompt prompt --negative state.negative
                               --seed state.seed --run-tag iter{i}
                               --workflow wf.json --analysis-dir <report_dir>
-  if report.overall_score >= TARGET: stop("converged", state)         # e.g. 0.85
-  if report.overall_score > best_score:
-      best_score = report.overall_score ; best_state = state ; stale = 0
+  if report.mismatch_count == 0 and report.overall_score >= TARGET:
+      stop("converged", state)                 # TARGET e.g. 0.9 (mostly match)
+  if report.overall_score > best:
+      best = report.overall_score ; best_state = state ; stale = 0
  else:
      stale += 1 ; state = best_state          # revert the change that didn't help
  if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)

-  worst = axis with the lowest report.axes[*].score
+  worst = a `mismatch` axis (else a `partial` axis) from report.axes
  target_value = report.axes[worst].ref         # what the reference shows
  state = apply(best_state, worst, edit_toward(target_value))   # change ONE axis
 ```
@@ -82,30 +89,30 @@ phrase to "doggy style"). No machine-supplied fix list — the agent owns this s

 ### Rules that matter

-1. **Change one axis per iteration** — clean attribution of the score delta. Batch two
-   only when both are very low and clearly independent.
+1. **Change one axis per iteration** — clean attribution of the delta. Batch two only when
+   both are `mismatch` and clearly independent.
 2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
   noise. Vary the seed only after converging, to confirm robustness.
 3. **Always edit from `best_state`**, never from a worse last state.
-4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
-   after a try, try an alternative phrasing for that axis before moving on.
-5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
-   orderings; if two candidates are within ~0.03, re-run each on a second seed.
-6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.
+4. **Prioritize `mismatch` axes, then `partial`.** Steer toward `ref`; if the obvious
+   wording doesn't flip the verdict, try an alternative phrasing before moving on.
+5. **Trust the verdict + the ref/gen text, not fine score deltas.** The overall score is a
+   coarse mean; use `mismatch_count` falling as the real progress signal.
+6. **Log every step**: `(iter, axis_changed, old→new, overall_score, mismatch_count)`.

 ## Worked example

 ```
-iter1  overall=0.41   worst: scene 0.30  ref:[dim bedroom]   gen:[bright kitchen]
+iter1  overall=0.55  mism=6   worst: scene MISMATCH  ref:[dim bedroom] gen:[bright kitchen]
       edit scene → "dimly lit bedroom"
-iter2  overall=0.58   worst: position 0.35  ref:[doggy style] gen:[missionary]
-       edit position → "doggy style"
-iter3  overall=0.71   worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
-       edit lighting → "warm low-key lighting"   (0.69 → revert)
-iter4  overall=0.69   retry lighting → "warm golden low-key glow"   (0.84 → keep)
-iter5  overall=0.84   worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
-       edit clothing → "red lace lingerie"
-iter6  overall=0.89   ≥ target → STOP
+iter2  overall=0.63  mism=5   worst: position_name MISMATCH ref:[doggy style] gen:[cowgirl]
+       edit position → "doggy style, from behind"
+iter3  overall=0.71  mism=3   worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight]
+       edit lighting → "warm low-key lighting"   (mism=4 → revert)
+iter4  retry lighting → "warm golden low-key glow"   (mism=2 → keep, overall=0.82)
+iter5  overall=0.88  mism=1   worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long]
+       edit hair → "curly shoulder-length brown hair"
+iter6  overall=0.93  mism=0   ≥ target → STOP
 ```

 ## Report shape the agent reads (`latest.json` / stdout)
@@ -113,10 +120,11 @@ iter6  overall=0.89   ≥ target → STOP
 ```json
 {
  "run_tag": "iter002",
-  "overall_score": 0.58,
+  "overall_score": 0.63,
+  "mismatch_count": 5,
  "axes": {
-    "position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
-    "scene":    {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
+    "position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
+    "scene":         {"verdict": "match",    "ref": "dim bedroom", "gen": "dim bedroom"}
  },
  "prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
 }
@@ -125,9 +133,10 @@ iter6  overall=0.89   ≥ target → STOP
 ## Agent system prompt (paste into your CLI agent)

 > You are the controller for a local image prompt calibrator. Goal: make a generated
-> image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
-> body, wardrobe, action, affect, camera, render) and for each returns `score` (0–1
-> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
+> image match a reference, measured by a Qwen3-VL judge that compares ~24 axes (identity,
+> body, wardrobe, action/pose, affect, camera, render) and for each returns a `verdict`
+> (match / partial / mismatch), `ref` (what the reference shows) and `gen` (what the
+> generated shows). `overall_score` and `mismatch_count` are computed from the verdicts.
 >
 > You hold an **axis state** (current value per axis). Each turn: (1) render it to a
 > prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
@@ -135,8 +144,8 @@ iter6  overall=0.89   ≥ target → STOP
 > (3) read the printed JSON.
 >
 > Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
-> revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
-> wording to match its `ref` value (you decide the wording — there are no machine-supplied
-> fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
-> `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
-> Log every step and report the best prompt + score.
+> revert to the best state; pick a **mismatch** axis (else a **partial** axis) and rewrite
+> that axis's prompt wording to match its `ref` value (you decide the wording — there are
+> no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching.
+> Stop when `mismatch_count == 0` and `overall_score ≥ TARGET` (default 0.9), or after
+> PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.
@@ -118,23 +118,26 @@ observes; it suggests no fixes (a stronger external model owns correction).

 ```json
 {
-  "overall_score": 0.0,
  "axes": {
-    "subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
-    "position":      {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
-    "clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
-    "scene":         {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
-    "framing":       {"score": 0.6, "ref": "full body", "gen": "close-up"},
-    "lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
+    "subject_count":  {"verdict": "match",    "ref": "1 woman", "gen": "1 woman"},
+    "position_name":  {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
+    "clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
+    "scene":          {"verdict": "partial",  "ref": "dim bedroom", "gen": "lit bedroom"},
+    "lighting_color": {"verdict": "match",    "ref": "warm low-key", "gen": "warm low-key"}
  }
 }
 ```

-The axis list is **configurable** on the node. The default ~20 axes are grouped as
-identity / body / wardrobe / action / affect / camera / render, kept granular so the
-*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
-stays discriminative for explicit content. The agent steers each low axis's prompt wording
-toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
+A **discrete verdict** (match/partial/mismatch) is used instead of a 0–1 score: small VLMs
+give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
+buckets reliably. `overall_score` + `mismatch_count` are computed from the verdicts on our
+side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
+**configurable**; the default ~24 axes are grouped identity / body / wardrobe / action·pose
+/ affect / camera / render, with the action·pose cluster split fine (`sexual_act`,
+`position_name`, `body_orientation`, `limb_arrangement`, `penetration`, `contact_points`,
+`genital_visibility`) so it stays discriminative for explicit content. Each axis carries a
+one-line definition in the prompt. The agent steers each `mismatch`/`partial` axis toward
+its `ref`. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).

 ### Reducing VLM‑as‑judge variance (important)