Switch compare to discrete verdicts + granular pose axes + per-axis definitions
The 4B's 0-1 scores were unreliable (identical ref/gen scored ~0.6), so the judge now returns verdict match/partial/mismatch per axis; overall_score and a new mismatch_count are computed from verdicts on our side (reliable, monotonic). Expanded the action/pose cluster into position_name, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility (+ breast_size) so explicit poses carry detail. Each axis now ships a one-line definition in the prompt so gender_mix/subject_count stop absorbing positional text. 24 axes total. Example workflows use the node default (axes=''). Docs realigned; stop condition is now mismatch_count==0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -51,11 +51,11 @@ default skip download entirely.
|
|||||||
|
|
||||||
| name | type | use |
|
| name | type | use |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `overall_score` | FLOAT 0..1 | compare: loop stop-condition / objective. describe: `1.0` placeholder |
|
| `overall_score` | FLOAT 0..1 | compare: mean verdict (computed here, not by the model). describe: `1.0` placeholder |
|
||||||
| `axis_scores_json` | STRING (JSON) | compare: per-axis `{score, ref, gen}`. describe: per-axis target values `{axis: value}` |
|
| `axis_scores_json` | STRING (JSON) | compare: per-axis `{verdict, ref, gen}` (verdict = match/partial/mismatch). describe: `{axis: value}` |
|
||||||
| `analysis` | STRING | compare: summary, worst axes first (`score ref:[…] gen:[…]`). describe: the prompt-ready `caption` |
|
| `analysis` | STRING | compare: header (`overall, N mismatches`) + axes worst-first (`VERDICT ref:[…] gen:[…]`). describe: the `caption` |
|
||||||
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
|
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
|
||||||
| `report_path` | STRING | path to the written `calib_<tag>.json` |
|
| `report_path` | STRING | path to the written `calib_<tag>.json` (carries `mismatch_count`) |
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|
||||||
|
|||||||
+3
-2
@@ -59,9 +59,10 @@ Stdout (captured by the agent) is the report:
|
|||||||
{
|
{
|
||||||
"run_tag": "iter003",
|
"run_tag": "iter003",
|
||||||
"overall_score": 0.62,
|
"overall_score": 0.62,
|
||||||
|
"mismatch_count": 1,
|
||||||
"axes": {
|
"axes": {
|
||||||
"position": {"score": 0.40, "ref": "doggy style", "gen": "missionary"},
|
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
|
||||||
"clothing_state": {"score": 0.85, "ref": "red lace lingerie", "gen": "plain bra"}
|
"clothing_state": {"verdict": "partial", "ref": "red lace lingerie", "gen": "plain bra"}
|
||||||
},
|
},
|
||||||
"prompt_used": "...",
|
"prompt_used": "...",
|
||||||
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
||||||
|
|||||||
+50
-41
@@ -14,11 +14,14 @@ the agent needs three things:
|
|||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
|
||||||
| `gen` | what the **generated** image shows | the **current** state — what to change |
|
| `gen` | what the **generated** image shows | the **current** state — what to change |
|
||||||
| `score` | 0–1 closeness | the **gap / priority** — which axes to fix first |
|
| `verdict` | `match` / `partial` / `mismatch` | which axes to fix first (mismatch → partial → match) |
|
||||||
|
|
||||||
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
|
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
|
||||||
prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
|
prompt so `gen → ref` on the **mismatch** (then `partial`) axes. The judge returns
|
||||||
(`{"score", "ref", "gen"}`) plus a top-level `overall_score`.
|
`{"verdict", "ref", "gen"}` per axis. A discrete verdict is used because small VLMs give
|
||||||
|
**unreliable 0–1 scores** (identical ref/gen often scored 0.6) but classify match/partial/
|
||||||
|
mismatch reliably. `overall_score` and `mismatch_count` are computed **from the verdicts on
|
||||||
|
our side** (mean ordinal), so they're monotonic and trustworthy as a stop signal.
|
||||||
|
|
||||||
The axes must **span what the prompt can express** — you can only fix what the prompt can
|
The axes must **span what the prompt can express** — you can only fix what the prompt can
|
||||||
say, and each diff must map to a lever. The default set (configurable on the node) is
|
say, and each diff must map to a lever. The default set (configurable on the node) is
|
||||||
@@ -27,16 +30,19 @@ grouped below.
|
|||||||
## Axes (default set — edit `axes` on the node to taste)
|
## Axes (default set — edit `axes` on the node to taste)
|
||||||
|
|
||||||
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
|
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
|
||||||
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
|
- **Body:** `body_type`, `breast_size`, `distinctive_features` (tattoos/piercings/marks), `hair`
|
||||||
- **Wardrobe:** `clothing_state` (degree of undress + garments)
|
- **Wardrobe:** `clothing_state` (degree of undress + garments)
|
||||||
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
|
- **Action / pose (where explicit content concentrates — kept granular):** `sexual_act`,
|
||||||
`penetration`, `explicitness`, `body_contact`
|
`position_name` (doggy/cowgirl/…), `body_orientation` (on top/from behind/…),
|
||||||
- **Affect:** `pose`, `facial_expression`, `gaze`
|
`limb_arrangement` (legs spread/raised, hands), `penetration` (type/depth/angle),
|
||||||
|
`contact_points`, `genital_visibility`, `pose` (torso/head lean)
|
||||||
|
- **Affect:** `facial_expression`, `gaze`
|
||||||
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
|
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
|
||||||
- **Render:** `scene`, `lighting_color`, `art_style`
|
- **Render:** `scene`, `lighting_color`, `art_style`
|
||||||
|
|
||||||
Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
|
Each axis carries a one-line definition in the prompt (so e.g. `gender_mix` is a *count*,
|
||||||
interaction cluster granular so the agent gets actionable targets.
|
not a position). Coarse axes blur the differences that matter for adult imagery; the act /
|
||||||
|
pose cluster is split into many axes so the agent gets specific, actionable targets.
|
||||||
|
|
||||||
## Step 0 — first pass (describe / bootstrap)
|
## Step 0 — first pass (describe / bootstrap)
|
||||||
|
|
||||||
@@ -57,21 +63,22 @@ written by hand — the VL provides the target to reproduce.
|
|||||||
## Per-iteration algorithm (greedy per-axis hill-climb)
|
## Per-iteration algorithm (greedy per-axis hill-climb)
|
||||||
|
|
||||||
```
|
```
|
||||||
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
best = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
||||||
loop:
|
loop:
|
||||||
i += 1
|
i += 1
|
||||||
prompt = render(state) # state = current value per axis
|
prompt = render(state) # state = current value per axis
|
||||||
report = run agent_bridge.py --prompt prompt --negative state.negative
|
report = run agent_bridge.py --prompt prompt --negative state.negative
|
||||||
--seed state.seed --run-tag iter{i}
|
--seed state.seed --run-tag iter{i}
|
||||||
--workflow wf.json --analysis-dir <report_dir>
|
--workflow wf.json --analysis-dir <report_dir>
|
||||||
if report.overall_score >= TARGET: stop("converged", state) # e.g. 0.85
|
if report.mismatch_count == 0 and report.overall_score >= TARGET:
|
||||||
if report.overall_score > best_score:
|
stop("converged", state) # TARGET e.g. 0.9 (mostly match)
|
||||||
best_score = report.overall_score ; best_state = state ; stale = 0
|
if report.overall_score > best:
|
||||||
|
best = report.overall_score ; best_state = state ; stale = 0
|
||||||
else:
|
else:
|
||||||
stale += 1 ; state = best_state # revert the change that didn't help
|
stale += 1 ; state = best_state # revert the change that didn't help
|
||||||
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
|
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
|
||||||
|
|
||||||
worst = axis with the lowest report.axes[*].score
|
worst = a `mismatch` axis (else a `partial` axis) from report.axes
|
||||||
target_value = report.axes[worst].ref # what the reference shows
|
target_value = report.axes[worst].ref # what the reference shows
|
||||||
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
|
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
|
||||||
```
|
```
|
||||||
@@ -82,30 +89,30 @@ phrase to "doggy style"). No machine-supplied fix list — the agent owns this s
|
|||||||
|
|
||||||
### Rules that matter
|
### Rules that matter
|
||||||
|
|
||||||
1. **Change one axis per iteration** — clean attribution of the score delta. Batch two
|
1. **Change one axis per iteration** — clean attribution of the delta. Batch two only when
|
||||||
only when both are very low and clearly independent.
|
both are `mismatch` and clearly independent.
|
||||||
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
|
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
|
||||||
noise. Vary the seed only after converging, to confirm robustness.
|
noise. Vary the seed only after converging, to confirm robustness.
|
||||||
3. **Always edit from `best_state`**, never from a worse last state.
|
3. **Always edit from `best_state`**, never from a worse last state.
|
||||||
4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
|
4. **Prioritize `mismatch` axes, then `partial`.** Steer toward `ref`; if the obvious
|
||||||
after a try, try an alternative phrasing for that axis before moving on.
|
wording doesn't flip the verdict, try an alternative phrasing before moving on.
|
||||||
5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
|
5. **Trust the verdict + the ref/gen text, not fine score deltas.** The overall score is a
|
||||||
orderings; if two candidates are within ~0.03, re-run each on a second seed.
|
coarse mean; use `mismatch_count` falling as the real progress signal.
|
||||||
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.
|
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, mismatch_count)`.
|
||||||
|
|
||||||
## Worked example
|
## Worked example
|
||||||
|
|
||||||
```
|
```
|
||||||
iter1 overall=0.41 worst: scene 0.30 ref:[dim bedroom] gen:[bright kitchen]
|
iter1 overall=0.55 mism=6 worst: scene MISMATCH ref:[dim bedroom] gen:[bright kitchen]
|
||||||
edit scene → "dimly lit bedroom"
|
edit scene → "dimly lit bedroom"
|
||||||
iter2 overall=0.58 worst: position 0.35 ref:[doggy style] gen:[missionary]
|
iter2 overall=0.63 mism=5 worst: position_name MISMATCH ref:[doggy style] gen:[cowgirl]
|
||||||
edit position → "doggy style"
|
edit position → "doggy style, from behind"
|
||||||
iter3 overall=0.71 worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
|
iter3 overall=0.71 mism=3 worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight]
|
||||||
edit lighting → "warm low-key lighting" (0.69 → revert)
|
edit lighting → "warm low-key lighting" (mism=4 → revert)
|
||||||
iter4 overall=0.69 retry lighting → "warm golden low-key glow" (0.84 → keep)
|
iter4 retry lighting → "warm golden low-key glow" (mism=2 → keep, overall=0.82)
|
||||||
iter5 overall=0.84 worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
|
iter5 overall=0.88 mism=1 worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long]
|
||||||
edit clothing → "red lace lingerie"
|
edit hair → "curly shoulder-length brown hair"
|
||||||
iter6 overall=0.89 ≥ target → STOP
|
iter6 overall=0.93 mism=0 ≥ target → STOP
|
||||||
```
|
```
|
||||||
|
|
||||||
## Report shape the agent reads (`latest.json` / stdout)
|
## Report shape the agent reads (`latest.json` / stdout)
|
||||||
@@ -113,10 +120,11 @@ iter6 overall=0.89 ≥ target → STOP
|
|||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"run_tag": "iter002",
|
"run_tag": "iter002",
|
||||||
"overall_score": 0.58,
|
"overall_score": 0.63,
|
||||||
|
"mismatch_count": 5,
|
||||||
"axes": {
|
"axes": {
|
||||||
"position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
|
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
|
||||||
"scene": {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
|
"scene": {"verdict": "match", "ref": "dim bedroom", "gen": "dim bedroom"}
|
||||||
},
|
},
|
||||||
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
|
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
|
||||||
}
|
}
|
||||||
@@ -125,9 +133,10 @@ iter6 overall=0.89 ≥ target → STOP
|
|||||||
## Agent system prompt (paste into your CLI agent)
|
## Agent system prompt (paste into your CLI agent)
|
||||||
|
|
||||||
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
||||||
> image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
|
> image match a reference, measured by a Qwen3-VL judge that compares ~24 axes (identity,
|
||||||
> body, wardrobe, action, affect, camera, render) and for each returns `score` (0–1
|
> body, wardrobe, action/pose, affect, camera, render) and for each returns a `verdict`
|
||||||
> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
|
> (match / partial / mismatch), `ref` (what the reference shows) and `gen` (what the
|
||||||
|
> generated shows). `overall_score` and `mismatch_count` are computed from the verdicts.
|
||||||
>
|
>
|
||||||
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
|
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
|
||||||
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
||||||
@@ -135,8 +144,8 @@ iter6 overall=0.89 ≥ target → STOP
|
|||||||
> (3) read the printed JSON.
|
> (3) read the printed JSON.
|
||||||
>
|
>
|
||||||
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
|
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
|
||||||
> revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
|
> revert to the best state; pick a **mismatch** axis (else a **partial** axis) and rewrite
|
||||||
> wording to match its `ref` value (you decide the wording — there are no machine-supplied
|
> that axis's prompt wording to match its `ref` value (you decide the wording — there are
|
||||||
> fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
|
> no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching.
|
||||||
> `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
|
> Stop when `mismatch_count == 0` and `overall_score ≥ TARGET` (default 0.9), or after
|
||||||
> Log every step and report the best prompt + score.
|
> PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.
|
||||||
|
|||||||
+15
-12
@@ -118,23 +118,26 @@ observes; it suggests no fixes (a stronger external model owns correction).
|
|||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"overall_score": 0.0,
|
|
||||||
"axes": {
|
"axes": {
|
||||||
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
|
"subject_count": {"verdict": "match", "ref": "1 woman", "gen": "1 woman"},
|
||||||
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
|
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
|
||||||
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
|
"clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
|
||||||
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
|
"scene": {"verdict": "partial", "ref": "dim bedroom", "gen": "lit bedroom"},
|
||||||
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
|
"lighting_color": {"verdict": "match", "ref": "warm low-key", "gen": "warm low-key"}
|
||||||
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The axis list is **configurable** on the node. The default ~20 axes are grouped as
|
A **discrete verdict** (match/partial/mismatch) is used instead of a 0–1 score: small VLMs
|
||||||
identity / body / wardrobe / action / affect / camera / render, kept granular so the
|
give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
|
||||||
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
|
buckets reliably. `overall_score` + `mismatch_count` are computed from the verdicts on our
|
||||||
stays discriminative for explicit content. The agent steers each low axis's prompt wording
|
side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
|
||||||
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
**configurable**; the default ~24 axes are grouped identity / body / wardrobe / action·pose
|
||||||
|
/ affect / camera / render, with the action·pose cluster split fine (`sexual_act`,
|
||||||
|
`position_name`, `body_orientation`, `limb_arrangement`, `penetration`, `contact_points`,
|
||||||
|
`genital_visibility`) so it stays discriminative for explicit content. Each axis carries a
|
||||||
|
one-line definition in the prompt. The agent steers each `mismatch`/`partial` axis toward
|
||||||
|
its `ref`. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||||||
|
|
||||||
### Reducing VLM‑as‑judge variance (important)
|
### Reducing VLM‑as‑judge variance (important)
|
||||||
|
|
||||||
|
|||||||
+91
-56
@@ -41,36 +41,44 @@ RECOMMENDED_MODELS = {
|
|||||||
"4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated",
|
"4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated",
|
||||||
}
|
}
|
||||||
|
|
||||||
# Difference axes the judge scores. Granular by default so the comparison is
|
# Difference axes + a one-line definition each. Definitions are injected into the
|
||||||
# discriminative for explicit/adult imagery (where coarse axes blur the differences
|
# prompt so the model fills the right axis (e.g. gender_mix = a count, not a position)
|
||||||
# that matter). Fully configurable on the node — trim or extend per use case.
|
# and the action/pose cluster is captured in detail. Fully configurable on the node;
|
||||||
# subject_count number of people
|
# any axis not in this map is still allowed (shown to the model by name only).
|
||||||
# gender_mix gender composition (e.g. 1F, 2F1M)
|
AXIS_DEFS = {
|
||||||
# body_type physique / build / proportions per subject
|
# identity / cast
|
||||||
# distinctive_features tattoos / piercings / marks (identity anchors)
|
"subject_count": "how many people are present (a count)",
|
||||||
# age_appearance apparent age
|
"gender_mix": "composition BY GENDER as a count, e.g. '1 female, 1 male' (NOT positions)",
|
||||||
# ethnicity_skin ethnicity / skin tone
|
"age_appearance": "apparent age range of each subject",
|
||||||
# hair length, color, style
|
"ethnicity_skin": "ethnicity and skin tone",
|
||||||
# clothing_state degree of undress + specific garments
|
# body
|
||||||
# sexual_act the act / activity being performed
|
"body_type": "overall physique / build (slim, curvy, athletic, BBW...)",
|
||||||
# position sexual position / arrangement of bodies
|
"breast_size": "breast size and shape of female subject(s)",
|
||||||
# penetration type & visibility of penetration
|
"distinctive_features": "tattoos, piercings, nail polish, scars — identity anchors",
|
||||||
# explicitness how graphic / genital visibility level
|
"hair": "hair length, color, texture, and style",
|
||||||
# body_contact who contacts whom; interaction between subjects
|
# wardrobe
|
||||||
# pose non-act body positioning
|
"clothing_state": "degree of undress and any garments / lingerie / accessories",
|
||||||
# facial_expression face / affect
|
# action & pose cluster (the crux for explicit content — be specific)
|
||||||
# gaze eye contact / look direction
|
"sexual_act": "type of activity: vaginal, anal, oral/blowjob, handjob, fingering, none...",
|
||||||
# framing shot type / crop (close-up <-> full body)
|
"position_name": "the named sex position if identifiable (doggy, missionary, cowgirl/reverse, spooning, 69...)",
|
||||||
# camera_angle POV / angle / perspective
|
"body_orientation": "how bodies are oriented: who is on top/bottom/side, facing each other or from behind",
|
||||||
# scene location / setting / background
|
"limb_arrangement": "placement of legs and arms (spread, bent, raised, over shoulder, kneeling) and hand placement",
|
||||||
# lighting_color palette, lighting, color grade
|
"penetration": "penetration type, depth (shallow/full), angle, and how visible it is",
|
||||||
# art_style photoreal vs anime/illustrated, render style
|
"contact_points": "where bodies touch: grip/hands location, mouth, points of contact",
|
||||||
DEFAULT_AXES = (
|
"genital_visibility": "which genitals are visible and how explicitly the frame shows them",
|
||||||
"subject_count, gender_mix, body_type, distinctive_features, age_appearance, "
|
"pose": "overall body posture not covered above (torso/head lean, arch, twist)",
|
||||||
"ethnicity_skin, hair, clothing_state, sexual_act, position, penetration, "
|
# affect
|
||||||
"explicitness, body_contact, pose, facial_expression, gaze, framing, "
|
"facial_expression": "facial expression / affect (eyes, mouth, brow)",
|
||||||
"camera_angle, scene, lighting_color, art_style"
|
"gaze": "gaze direction / eye contact (at camera, partner, away, eyes closed)",
|
||||||
)
|
# camera
|
||||||
|
"framing": "shot type and crop (close-up, medium, full body) and what the frame centers on",
|
||||||
|
"camera_angle": "camera angle / POV (low, high, eye-level, POV/first-person)",
|
||||||
|
# render
|
||||||
|
"scene": "location, furniture, props, background",
|
||||||
|
"lighting_color": "lighting quality and color palette / grade",
|
||||||
|
"art_style": "rendering style and realism (photoreal, anime, illustration, 3D)",
|
||||||
|
}
|
||||||
|
DEFAULT_AXES = ", ".join(AXIS_DEFS)
|
||||||
|
|
||||||
# Cache loaded (model, processor) keyed by (path, precision) so the loop does not
|
# Cache loaded (model, processor) keyed by (path, precision) so the loop does not
|
||||||
# reload weights every iteration.
|
# reload weights every iteration.
|
||||||
@@ -224,32 +232,35 @@ def _ensure_chat_template(processor, model_path: str):
|
|||||||
processor.chat_template = tok.chat_template
|
processor.chat_template = tok.chat_template
|
||||||
|
|
||||||
|
|
||||||
|
def _axis_definition_block(axes: list[str]) -> str:
|
||||||
|
return "\n".join(f" - {a}: {AXIS_DEFS.get(a, 'as named')}" for a in axes)
|
||||||
|
|
||||||
|
|
||||||
def _build_system_prompt(axes: list[str]) -> str:
|
def _build_system_prompt(axes: list[str]) -> str:
|
||||||
axis_lines = "\n".join(
|
axis_lines = "\n".join(
|
||||||
f' "{a}": {{"score": <0..1>, "ref": "<what IMAGE 1 shows>", "gen": "<what IMAGE 2 shows>"}},'
|
f' "{a}": {{"verdict": "match|partial|mismatch", "ref": "<IMAGE 1>", "gen": "<IMAGE 2>"}},'
|
||||||
for a in axes)
|
for a in axes)
|
||||||
return (
|
return (
|
||||||
"You are a meticulous visual-similarity judge for an image-generation "
|
"You are a meticulous visual-similarity judge for an image-generation "
|
||||||
"calibration loop. You are shown two images: IMAGE 1 is the REFERENCE "
|
"calibration loop. You are shown two images: IMAGE 1 is the REFERENCE "
|
||||||
"(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely "
|
"(the target) and IMAGE 2 is the GENERATED candidate.\n\n"
|
||||||
"the GENERATED image reproduces the REFERENCE.\n\n"
|
|
||||||
"For every axis report THREE things:\n"
|
"For every axis report THREE things:\n"
|
||||||
" - ref: concretely what IMAGE 1 (reference / target) shows for this axis\n"
|
" - ref: concretely what IMAGE 1 (reference) shows for this axis\n"
|
||||||
" - gen: concretely what IMAGE 2 (generated) shows for this axis\n"
|
" - gen: concretely what IMAGE 2 (generated) shows for this axis\n"
|
||||||
" - score: 0..1 closeness, where 0.0 = unrelated, 0.5 = same general "
|
" - verdict: 'match' if ref and gen are essentially the same; 'partial' if "
|
||||||
"category but clearly different details, 1.0 = near-identical.\n"
|
"the same general idea but with a clear difference; 'mismatch' if clearly "
|
||||||
"Use specific concrete values (e.g. ref 'doggy style', gen 'missionary'), "
|
"different. If ref and gen describe the same thing, verdict MUST be 'match'.\n"
|
||||||
"not vague notes. Describe ONLY what you observe — do NOT suggest fixes or "
|
"Use specific concrete values (e.g. ref 'doggy style', gen 'cowgirl'), not "
|
||||||
"prompt changes; correction is handled by a separate model.\n\n"
|
"vague notes. Describe ONLY what you observe — do NOT suggest fixes.\n\n"
|
||||||
|
"Axes and exactly what each one means:\n"
|
||||||
|
f"{_axis_definition_block(axes)}\n\n"
|
||||||
"Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n"
|
"Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n"
|
||||||
"{\n"
|
"{\n"
|
||||||
' "overall_score": <0..1>,\n'
|
|
||||||
' "axes": {\n'
|
' "axes": {\n'
|
||||||
f"{axis_lines}\n"
|
f"{axis_lines}\n"
|
||||||
" }\n"
|
" }\n"
|
||||||
"}\n"
|
"}\n"
|
||||||
"overall_score must be consistent with the per-axis scores. If an axis is "
|
"If an axis does not apply to either image, verdict 'match' and ref/gen 'n/a'."
|
||||||
"not applicable to either image, set score 1.0 and ref/gen to \"n/a\"."
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -378,6 +389,27 @@ def _parse_json(raw: str) -> dict | None:
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
_VERDICT_ORDINAL = {"match": 1.0, "partial": 0.5, "mismatch": 0.0}
|
||||||
|
|
||||||
|
|
||||||
|
def _verdict_ordinal(verdict) -> float:
|
||||||
|
return _VERDICT_ORDINAL.get(str(verdict).strip().lower(), 0.0)
|
||||||
|
|
||||||
|
|
||||||
|
def _ordinal_verdict(x: float) -> str:
|
||||||
|
return "match" if x >= 0.75 else ("partial" if x >= 0.25 else "mismatch")
|
||||||
|
|
||||||
|
|
||||||
|
def _score_from_axes(axes: dict) -> tuple[float, int]:
|
||||||
|
"""Deterministic overall score (mean verdict ordinal) + mismatch count.
|
||||||
|
Computed here, not by the model, so it's reliable and monotonic."""
|
||||||
|
if not axes:
|
||||||
|
return 0.0, 0
|
||||||
|
ordinals = [_verdict_ordinal(v.get("verdict")) for v in axes.values()]
|
||||||
|
mismatches = sum(1 for o in ordinals if o == 0.0)
|
||||||
|
return round(sum(ordinals) / len(ordinals), 4), mismatches
|
||||||
|
|
||||||
|
|
||||||
def _merge_swapped(a: dict, b: dict) -> dict:
|
def _merge_swapped(a: dict, b: dict) -> dict:
|
||||||
"""Average two judgements (normal + order-swapped) to cut position bias."""
|
"""Average two judgements (normal + order-swapped) to cut position bias."""
|
||||||
if not b:
|
if not b:
|
||||||
@@ -385,19 +417,17 @@ def _merge_swapped(a: dict, b: dict) -> dict:
|
|||||||
if not a:
|
if not a:
|
||||||
return b
|
return b
|
||||||
out = {"axes": {}}
|
out = {"axes": {}}
|
||||||
out["overall_score"] = round(
|
|
||||||
(float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4
|
|
||||||
)
|
|
||||||
axes = set(a.get("axes", {})) | set(b.get("axes", {}))
|
axes = set(a.get("axes", {})) | set(b.get("axes", {}))
|
||||||
for ax in axes:
|
for ax in axes:
|
||||||
sa = a.get("axes", {}).get(ax, {})
|
sa = a.get("axes", {}).get(ax, {})
|
||||||
sb = b.get("axes", {}).get(ax, {})
|
sb = b.get("axes", {}).get(ax, {})
|
||||||
score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0
|
# Average the two passes' verdicts on a 0/0.5/1 scale, then re-bucket.
|
||||||
|
ord_avg = (_verdict_ordinal(sa.get("verdict")) + _verdict_ordinal(sb.get("verdict"))) / 2.0
|
||||||
# In pass b the images were swapped, so b.ref describes the generated image
|
# In pass b the images were swapped, so b.ref describes the generated image
|
||||||
# and b.gen the reference -> invert b when falling back.
|
# and b.gen the reference -> invert b when falling back.
|
||||||
ref = sa.get("ref") or sb.get("gen") or ""
|
ref = sa.get("ref") or sb.get("gen") or ""
|
||||||
gen = sa.get("gen") or sb.get("ref") or ""
|
gen = sa.get("gen") or sb.get("ref") or ""
|
||||||
out["axes"][ax] = {"score": round(score, 4), "ref": ref, "gen": gen}
|
out["axes"][ax] = {"verdict": _ordinal_verdict(ord_avg), "ref": ref, "gen": gen}
|
||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
@@ -411,7 +441,8 @@ def _report_base_dir(report_dir: str) -> str:
|
|||||||
return os.path.join(os.path.dirname(os.path.dirname(__file__)), "output", "calibrator")
|
return os.path.join(os.path.dirname(os.path.dirname(__file__)), "output", "calibrator")
|
||||||
|
|
||||||
|
|
||||||
def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used):
|
def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used,
|
||||||
|
mismatch_count=0):
|
||||||
"""Persist the analysis so the external CLI agent can read it after a queue.
|
"""Persist the analysis so the external CLI agent can read it after a queue.
|
||||||
|
|
||||||
Writes a per-run file plus a stable `latest.json` the agent can always poll.
|
Writes a per-run file plus a stable `latest.json` the agent can always poll.
|
||||||
@@ -426,6 +457,7 @@ def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all,
|
|||||||
payload = {
|
payload = {
|
||||||
"run_tag": run_tag,
|
"run_tag": run_tag,
|
||||||
"overall_score": round(float(overall), 4),
|
"overall_score": round(float(overall), 4),
|
||||||
|
"mismatch_count": mismatch_count,
|
||||||
"axes": (merged or {}).get("axes", {}),
|
"axes": (merged or {}).get("axes", {}),
|
||||||
"diff_analysis": diff_analysis,
|
"diff_analysis": diff_analysis,
|
||||||
"prompt_used": prompt_used,
|
"prompt_used": prompt_used,
|
||||||
@@ -558,20 +590,23 @@ class QwenVLImageJudge:
|
|||||||
del model
|
del model
|
||||||
torch.cuda.empty_cache()
|
torch.cuda.empty_cache()
|
||||||
|
|
||||||
overall = float(merged.get("overall_score", 0.0)) if merged else 0.0
|
axes_map = merged.get("axes", {}) if merged else {}
|
||||||
axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}"
|
# Score is computed from verdicts here (reliable), not taken from the model.
|
||||||
|
overall, mismatch_count = _score_from_axes(axes_map)
|
||||||
|
axis_scores = json.dumps(axes_map, ensure_ascii=False, indent=2) if axes_map else "{}"
|
||||||
|
|
||||||
# Human/controller-readable diff summary, worst axes first (biggest gap).
|
# Summary worst-first: mismatch, then partial, then match.
|
||||||
items = sorted((merged.get("axes", {}) if merged else {}).items(),
|
items = sorted(axes_map.items(), key=lambda kv: _verdict_ordinal(kv[1].get("verdict")))
|
||||||
key=lambda kv: float(kv[1].get("score", 0)))
|
|
||||||
diff_lines = [
|
diff_lines = [
|
||||||
f"- {ax}: {info.get('score', 0):.2f} ref:[{info.get('ref', '')}] gen:[{info.get('gen', '')}]"
|
f"- {ax}: {str(info.get('verdict', '?')).upper():8} "
|
||||||
|
f"ref:[{info.get('ref', '')}] gen:[{info.get('gen', '')}]"
|
||||||
for ax, info in items
|
for ax, info in items
|
||||||
]
|
]
|
||||||
diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)"
|
header = f"overall {overall:.2f} | {mismatch_count} mismatch(es) of {len(axes_map)} axes"
|
||||||
|
diff_analysis = header + "\n" + "\n".join(diff_lines) if diff_lines else "(no parseable judgement)"
|
||||||
|
|
||||||
report_path = _write_report(
|
report_path = _write_report(
|
||||||
report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used)
|
report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used, mismatch_count)
|
||||||
|
|
||||||
return (round(overall, 4), axis_scores, diff_analysis, raw_all, report_path)
|
return (round(overall, 4), axis_scores, diff_analysis, raw_all, report_path)
|
||||||
|
|
||||||
|
|||||||
@@ -67,7 +67,7 @@
|
|||||||
"generated_image": ["8", 0],
|
"generated_image": ["8", 0],
|
||||||
"model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
|
"model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
|
||||||
"precision": "bf16",
|
"precision": "bf16",
|
||||||
"axes": "cast, clothing, pose, scene, composition, expression, color_light",
|
"axes": "",
|
||||||
"max_new_tokens": 512,
|
"max_new_tokens": 512,
|
||||||
"temperature": 0.0,
|
"temperature": 0.0,
|
||||||
"swap_eval": true,
|
"swap_eval": true,
|
||||||
|
|||||||
@@ -11,7 +11,7 @@
|
|||||||
"mode": "describe",
|
"mode": "describe",
|
||||||
"model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
|
"model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
|
||||||
"precision": "bf16",
|
"precision": "bf16",
|
||||||
"axes": "subject_count, gender_mix, body_type, distinctive_features, age_appearance, ethnicity_skin, hair, clothing_state, sexual_act, position, penetration, explicitness, body_contact, pose, facial_expression, gaze, framing, camera_angle, scene, lighting_color, art_style",
|
"axes": "",
|
||||||
"max_new_tokens": 1024,
|
"max_new_tokens": 1024,
|
||||||
"temperature": 0.0,
|
"temperature": 0.0,
|
||||||
"swap_eval": false,
|
"swap_eval": false,
|
||||||
|
|||||||
Reference in New Issue
Block a user