Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions
The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -34,7 +34,7 @@ can act on it.
|
|||||||
| `generated_image` | IMAGE | — | the candidate to score |
|
| `generated_image` | IMAGE | — | the candidate to score |
|
||||||
| `model_path` | STRING | `/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16` | local dir, **HF repo id** (`org/name`), or alias (`30b-a3b` / `8b` / `4b`) |
|
| `model_path` | STRING | `/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16` | local dir, **HF repo id** (`org/name`), or alias (`30b-a3b` / `8b` / `4b`) |
|
||||||
| `precision` | bf16 / fp16 / fp8 / nf4 | bf16 | `nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy |
|
| `precision` | bf16 / fp16 / fp8 / nf4 | bf16 | `nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy |
|
||||||
| `axes` | STRING | cast, clothing, pose, scene, composition, expression, color_light | scored axes (match your Prompt-Builder knobs) |
|
| `axes` | STRING | ~20 axes (identity, body, wardrobe, action, affect, camera, render) | scored axes; granular for explicit content. Edit to taste |
|
||||||
| `max_new_tokens` | INT | 512 | |
|
| `max_new_tokens` | INT | 512 | |
|
||||||
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
|
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
|
||||||
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
|
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
|
||||||
@@ -51,8 +51,8 @@ default skip download entirely.
|
|||||||
| name | type | use |
|
| name | type | use |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `overall_score` | FLOAT 0..1 | loop stop-condition / objective |
|
| `overall_score` | FLOAT 0..1 | loop stop-condition / objective |
|
||||||
| `axis_scores_json` | STRING (JSON) | per-axis `{score, diff}` for the controller |
|
| `axis_scores_json` | STRING (JSON) | per-axis `{score, ref, gen}` — target vs current, for the agent |
|
||||||
| `diff_analysis` | STRING | human/controller-readable summary + fix suggestions |
|
| `diff_analysis` | STRING | readable summary, worst axes first (`score ref:[…] gen:[…]`) |
|
||||||
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
|
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|||||||
+1
-1
@@ -19,7 +19,7 @@ Stdlib only — no third-party deps, so any agent can shell out to it.
|
|||||||
Loop, from the agent's side:
|
Loop, from the agent's side:
|
||||||
1. build a prompt (calibrate from the previous analysis)
|
1. build a prompt (calibrate from the previous analysis)
|
||||||
2. run this script -> capture stdout (the analysis JSON)
|
2. run this script -> capture stdout (the analysis JSON)
|
||||||
3. read overall_score + per-axis diffs + fix_suggestions
|
3. read overall_score + per-axis {score, ref, gen}
|
||||||
4. adjust the prompt and go to 1, until overall_score >= target
|
4. adjust the prompt and go to 1, until overall_score >= target
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|||||||
+11
-12
@@ -18,8 +18,8 @@ reads the analysis, calibrates the prompt generator, and queues the next iterati
|
|||||||
│ writes calib_<tag>.json + latest.json
|
│ writes calib_<tag>.json + latest.json
|
||||||
3. poll /history/{id} (bridge does this) ◄───────────┘
|
3. poll /history/{id} (bridge does this) ◄───────────┘
|
||||||
4. read report JSON (overall_score,
|
4. read report JSON (overall_score,
|
||||||
per-axis diffs, fix_suggestions)
|
per-axis score + ref/gen values)
|
||||||
5. adjust Prompt-Builder knobs / prompt
|
5. steer prompt toward ref on worst axes
|
||||||
└──► go to 1 until overall_score ≥ target
|
└──► go to 1 until overall_score ≥ target
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -60,23 +60,22 @@ Stdout (captured by the agent) is the report:
|
|||||||
"run_tag": "iter003",
|
"run_tag": "iter003",
|
||||||
"overall_score": 0.62,
|
"overall_score": 0.62,
|
||||||
"axes": {
|
"axes": {
|
||||||
"pose": {"score": 0.40, "diff": "ref standing, gen seated"},
|
"position": {"score": 0.40, "ref": "doggy style", "gen": "missionary"},
|
||||||
"clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"}
|
"clothing_state": {"score": 0.85, "ref": "red lace lingerie", "gen": "plain bra"}
|
||||||
},
|
},
|
||||||
"fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"],
|
"prompt_used": "...",
|
||||||
"prompt_used": "1 woman, red lingerie, ...",
|
|
||||||
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Agent calibration policy (suggested)
|
## Agent calibration policy (suggested)
|
||||||
|
|
||||||
The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the
|
For the lowest-scoring axes, the agent rewrites that axis's prompt wording to match its
|
||||||
`fix_suggestions`, regenerates, and keeps changes that raise `overall_score`
|
`ref` value (the target), regenerates, and keeps changes that raise `overall_score`
|
||||||
(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so
|
(greedy per-axis hill-climb). The local model supplies no fixes — the agent owns the
|
||||||
the score reflects the prompt, not sampler noise; vary the seed only once you're near the
|
correction. Keep the **T2I seed fixed** while searching so the score reflects the prompt,
|
||||||
target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every
|
not sampler noise; vary the seed only once near target. Stop at `overall_score ≥ target`
|
||||||
`(prompt, knobs, score)` so the search is auditable/resumable.
|
(e.g. 0.85) or a max-iteration budget. Full policy: **[CALIBRATION_POLICY.md](CALIBRATION_POLICY.md)**.
|
||||||
|
|
||||||
## Setup checklist
|
## Setup checklist
|
||||||
|
|
||||||
|
|||||||
+90
-99
@@ -1,135 +1,126 @@
|
|||||||
# Calibration policy — the agent's playbook
|
# Calibration policy — the agent's playbook
|
||||||
|
|
||||||
This is the instruction set the **external CLI agent** (the controller) follows each
|
The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
|
||||||
iteration. Paste the "Agent system prompt" block into your agent, give it the workflow
|
**external agent** (you / a stronger model) decides every correction. So the judge's job
|
||||||
path + reference image + target score, and let it loop.
|
is to hand the agent the *range of information needed to calibrate*, and the agent's job
|
||||||
|
is to turn that into prompt edits.
|
||||||
|
|
||||||
The agent calibrates by reasoning over the **Prompt‑Builder axes** and editing a
|
## What the agent needs from each comparison (the information model)
|
||||||
structured *axis state*, then **rendering that state to a prompt string** that it injects
|
|
||||||
into the `CalibratorPromptReceptor`. This keeps the reasoning axis‑aware while staying
|
|
||||||
compatible with the flat‑string receptor. (If you later switch the receptor to carry a
|
|
||||||
structured config, the same axis state maps straight onto Prompt‑Builder's split control
|
|
||||||
nodes.)
|
|
||||||
|
|
||||||
---
|
To move a generated image toward a reference, for **every dimension the prompt controls**
|
||||||
|
the agent needs three things:
|
||||||
|
|
||||||
## Axis state (the agent's working memory)
|
| field | meaning | why the agent needs it |
|
||||||
|
|---|---|---|
|
||||||
|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
|
||||||
|
| `gen` | what the **generated** image shows | the **current** state — what to change |
|
||||||
|
| `score` | 0–1 closeness | the **gap / priority** — which axes to fix first |
|
||||||
|
|
||||||
```json
|
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
|
||||||
{
|
prompt so `gen → ref` on the lowest-scoring axes. The judge returns exactly this per axis
|
||||||
"cast": "1 woman, mid-20s, athletic",
|
(`{"score", "ref", "gen"}`) plus a top-level `overall_score`.
|
||||||
"clothing": "red lace lingerie",
|
|
||||||
"pose": "standing, hand on hip",
|
|
||||||
"scene": "dimly lit bedroom",
|
|
||||||
"composition": "full-body shot, slight low angle",
|
|
||||||
"expression": "soft smile, eye contact",
|
|
||||||
"color_light": "warm rim light, shallow depth of field",
|
|
||||||
"quality": "photorealistic, high detail",
|
|
||||||
"negative": "blurry, deformed, lowres, extra limbs",
|
|
||||||
"seed": 12345
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried
|
The axes must **span what the prompt can express** — you can only fix what the prompt can
|
||||||
but not scored. Render order (subject → wardrobe → action → setting → framing → affect →
|
say, and each diff must map to a lever. The default set (configurable on the node) is
|
||||||
light → quality):
|
grouped below.
|
||||||
|
|
||||||
```
|
## Axes (default set — edit `axes` on the node to taste)
|
||||||
prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
|
||||||
|
- **Body:** `body_type`, `distinctive_features` (tattoos/piercings/marks), `hair`
|
||||||
|
- **Wardrobe:** `clothing_state` (degree of undress + garments)
|
||||||
|
- **Action (where explicit content concentrates):** `sexual_act`, `position`,
|
||||||
|
`penetration`, `explicitness`, `body_contact`
|
||||||
|
- **Affect:** `pose`, `facial_expression`, `gaze`
|
||||||
|
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
|
||||||
|
- **Render:** `scene`, `lighting_color`, `art_style`
|
||||||
|
|
||||||
## Per‑iteration algorithm (greedy per‑axis hill‑climb)
|
Coarse axes blur the differences that matter for adult imagery; this set keeps the act /
|
||||||
|
interaction cluster granular so the agent gets actionable targets.
|
||||||
|
|
||||||
|
## Per-iteration algorithm (greedy per-axis hill-climb)
|
||||||
|
|
||||||
```
|
```
|
||||||
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
||||||
loop:
|
loop:
|
||||||
i += 1
|
i += 1
|
||||||
prompt = render(state)
|
prompt = render(state) # state = current value per axis
|
||||||
report = run agent_bridge.py --prompt prompt --negative state.negative
|
report = run agent_bridge.py --prompt prompt --negative state.negative
|
||||||
--seed state.seed --run-tag iter{i}
|
--seed state.seed --run-tag iter{i}
|
||||||
--workflow wf.json --analysis-dir <report_dir>
|
--workflow wf.json --analysis-dir <report_dir>
|
||||||
score = report.overall_score
|
if report.overall_score >= TARGET: stop("converged", state) # e.g. 0.85
|
||||||
if score >= TARGET: # e.g. 0.85
|
if report.overall_score > best_score:
|
||||||
stop("converged", state, score)
|
best_score = report.overall_score ; best_state = state ; stale = 0
|
||||||
if score > best_score:
|
|
||||||
best_score = score ; best_state = state ; stale = 0
|
|
||||||
else:
|
else:
|
||||||
stale += 1
|
stale += 1 ; state = best_state # revert the change that didn't help
|
||||||
state = best_state # revert: undo the change that didn't help
|
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
|
||||||
if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25
|
|
||||||
stop("plateau/budget", best_state, best_score)
|
|
||||||
|
|
||||||
# choose the next single edit:
|
worst = axis with the lowest report.axes[*].score
|
||||||
worst_axis = axis with lowest per-axis score in report.axes
|
target_value = report.axes[worst].ref # what the reference shows
|
||||||
edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion
|
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
|
||||||
state = apply(best_state, worst_axis, edit) # change ONE axis only
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
|
||||||
|
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
|
||||||
|
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
|
||||||
|
|
||||||
### Rules that matter
|
### Rules that matter
|
||||||
|
|
||||||
1. **Change one axis per iteration.** One edit = clean attribution of the score delta.
|
1. **Change one axis per iteration** — clean attribution of the score delta. Batch two
|
||||||
Only batch two edits when two axes score very low *and* are clearly independent.
|
only when both are very low and clearly independent.
|
||||||
2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not
|
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
|
||||||
sampler noise. Vary the seed only after you've converged, to confirm robustness.
|
noise. Vary the seed only after converging, to confirm robustness.
|
||||||
3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the
|
3. **Always edit from `best_state`**, never from a worse last state.
|
||||||
"revert on no improvement" step. Prevents drifting down a bad path.
|
4. **Steer toward `ref`** on the worst axis; if the obvious wording doesn't move the score
|
||||||
4. **Target the lowest‑scoring axis first**, applying the Judge's matching
|
after a try, try an alternative phrasing for that axis before moving on.
|
||||||
`fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value
|
5. **Near the margin, don't over-trust one reading.** `swap_eval` already averages two
|
||||||
for that axis before moving on.
|
orderings; if two candidates are within ~0.03, re-run each on a second seed.
|
||||||
5. **Near the margin, don't over‑trust one reading.** `swap_eval` already averages two
|
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, worst-axes)`.
|
||||||
orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare
|
|
||||||
averages before committing.
|
|
||||||
6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size
|
|
||||||
(smaller, more specific wording changes) and re‑anchor on `best_state`.
|
|
||||||
7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, per‑axis)`.
|
|
||||||
The run must be auditable and resumable.
|
|
||||||
|
|
||||||
### Mapping `fix_suggestions` → axes
|
|
||||||
|
|
||||||
The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to
|
|
||||||
clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous,
|
|
||||||
attribute it to the lowest‑scoring axis it plausibly affects.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Worked example
|
## Worked example
|
||||||
|
|
||||||
```
|
```
|
||||||
iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41
|
iter1 overall=0.41 worst: scene 0.30 ref:[dim bedroom] gen:[bright kitchen]
|
||||||
axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
|
edit scene → "dimly lit bedroom"
|
||||||
fix: "set scene to a dim bedroom"
|
iter2 overall=0.58 worst: position 0.35 ref:[doggy style] gen:[missionary]
|
||||||
iter2 edit scene→"dimly lit bedroom" score=0.58 (kept)
|
edit position → "doggy style"
|
||||||
axes: pose 0.35 (worst) — "ref standing, gen seated"
|
iter3 overall=0.71 worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
|
||||||
iter3 edit pose→"standing, hand on hip" score=0.71 (kept)
|
edit lighting → "warm low-key lighting" (0.69 → revert)
|
||||||
axes: color_light 0.50 (worst) — "ref warm, gen flat"
|
iter4 overall=0.69 retry lighting → "warm golden low-key glow" (0.84 → keep)
|
||||||
iter4 edit color_light→"warm rim light" score=0.69 (worse → revert)
|
iter5 overall=0.84 worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
|
||||||
iter5 edit color_light→"warm golden hour glow" score=0.83 (kept)
|
edit clothing → "red lace lingerie"
|
||||||
axes: clothing 0.78 (worst) — "gen lacks lace detail"
|
iter6 overall=0.89 ≥ target → STOP
|
||||||
iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
## Report shape the agent reads (`latest.json` / stdout)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"run_tag": "iter002",
|
||||||
|
"overall_score": 0.58,
|
||||||
|
"axes": {
|
||||||
|
"position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
|
||||||
|
"scene": {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
|
||||||
|
},
|
||||||
|
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Agent system prompt (paste into your CLI agent)
|
## Agent system prompt (paste into your CLI agent)
|
||||||
|
|
||||||
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
||||||
> image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes
|
> image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity,
|
||||||
> (cast, clothing, pose, scene, composition, expression, color_light) from 0–1.
|
> body, wardrobe, action, affect, camera, render) and for each returns `score` (0–1
|
||||||
|
> closeness), `ref` (what the reference shows) and `gen` (what the generated shows).
|
||||||
>
|
>
|
||||||
> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a
|
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
|
||||||
> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→
|
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
||||||
> quality; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
> --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
|
||||||
> --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir
|
> (3) read the printed JSON.
|
||||||
> <report_dir>`; (3) read the printed JSON report.
|
|
||||||
>
|
>
|
||||||
> Then apply greedy per‑axis hill‑climb: keep the change only if `overall_score` improved,
|
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
|
||||||
> else revert to the best state; pick the **lowest‑scoring axis** and apply the Judge's
|
> revert to the best state; pick the **lowest-scoring axis** and rewrite that axis's prompt
|
||||||
> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching.
|
> wording to match its `ref` value (you decide the wording — there are no machine-supplied
|
||||||
> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 non‑improving
|
> fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at
|
||||||
> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.
|
> `overall_score ≥ TARGET` (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25.
|
||||||
>
|
> Log every step and report the best prompt + score.
|
||||||
> Never change more than one axis at a time unless two axes are both very low and clearly
|
|
||||||
> independent. Never trust a single near‑margin reading — re‑run on a second seed when two
|
|
||||||
> candidates are within 0.03.
|
|
||||||
|
|||||||
+25
-25
@@ -28,12 +28,12 @@
|
|||||||
│ Qwen3-VL JUDGE node ── the "vllm node" │
|
│ Qwen3-VL JUDGE node ── the "vllm node" │
|
||||||
│ in : reference + generated │
|
│ in : reference + generated │
|
||||||
│ out: overall_score 0..1 │
|
│ out: overall_score 0..1 │
|
||||||
│ per-axis scores (cast, clothing, pose, scene, │
|
│ per-axis {score, ref, gen} over ~20 axes │
|
||||||
│ composition, expression, color/lighting) │
|
│ (identity, body, wardrobe, action, affect, │
|
||||||
│ diff_analysis (JSON: what's off + how to fix, │
|
│ camera, render) — target vs current values │
|
||||||
│ phrased in Prompt-Builder axis vocabulary) │
|
│ (local model observes only; no fixes suggested) │
|
||||||
└────────────────────┬──────────────────────────────────┘
|
└────────────────────┬──────────────────────────────────┘
|
||||||
│ score + diffs
|
│ score + ref/gen per axis
|
||||||
┌────────────────────▼────────────────┐
|
┌────────────────────▼────────────────┐
|
||||||
│ CALIBRATOR / controller │
|
│ CALIBRATOR / controller │
|
||||||
│ - accumulate per-axis scores │
|
│ - accumulate per-axis scores │
|
||||||
@@ -111,30 +111,30 @@ is sequential anyway. The 8B bf16 judge co‑resides more easily.
|
|||||||
|
|
||||||
## 3. Scoring rubric (what the VLM actually returns)
|
## 3. Scoring rubric (what the VLM actually returns)
|
||||||
|
|
||||||
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and a score
|
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and, per axis,
|
||||||
per axis, where the axes mirror what Prompt‑Builder can control. This is what makes the
|
the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
|
||||||
diff *actionable* instead of generic prose.
|
the *target / current / distance* an agent needs to calibrate. The local model only
|
||||||
|
observes; it suggests no fixes (a stronger external model owns correction).
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"overall_score": 0.0,
|
"overall_score": 0.0,
|
||||||
"axes": {
|
"axes": {
|
||||||
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
|
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
|
||||||
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"},
|
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
|
||||||
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"},
|
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
|
||||||
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
|
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
|
||||||
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
|
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
|
||||||
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"},
|
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
|
||||||
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
|
}
|
||||||
},
|
|
||||||
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The axis list is **configurable** on the node so it can match whichever Prompt‑Builder
|
The axis list is **configurable** on the node. The default ~20 axes are grouped as
|
||||||
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
|
identity / body / wardrobe / action / affect / camera / render, kept granular so the
|
||||||
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
|
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
|
||||||
map each one onto a knob.
|
stays discriminative for explicit content. The agent steers each low axis's prompt wording
|
||||||
|
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||||||
|
|
||||||
### Reducing VLM‑as‑judge variance (important)
|
### Reducing VLM‑as‑judge variance (important)
|
||||||
|
|
||||||
@@ -162,10 +162,10 @@ LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis ov
|
|||||||
to maximize `overall_score`. Three controller options, easiest → strongest:
|
to maximize `overall_score`. Three controller options, easiest → strongest:
|
||||||
|
|
||||||
1. **Greedy per‑axis hill‑climb (start here).**
|
1. **Greedy per‑axis hill‑climb (start here).**
|
||||||
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
|
Take the lowest‑scoring axis, rewrite that axis's prompt wording toward its `ref`
|
||||||
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
|
(target) value, regenerate, keep the change if `overall_score` improved, else revert.
|
||||||
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
|
Loop until ≥ target or no axis improves. The agent decides the wording (no machine
|
||||||
Implementable today with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
fixes). Implementable with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
||||||
|
|
||||||
2. **Black‑box optimizer over the knob vector.**
|
2. **Black‑box optimizer over the knob vector.**
|
||||||
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /
|
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /
|
||||||
|
|||||||
+58
-24
@@ -41,7 +41,36 @@ RECOMMENDED_MODELS = {
|
|||||||
"4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated",
|
"4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated",
|
||||||
}
|
}
|
||||||
|
|
||||||
DEFAULT_AXES = "cast, clothing, pose, scene, composition, expression, color_light"
|
# Difference axes the judge scores. Granular by default so the comparison is
|
||||||
|
# discriminative for explicit/adult imagery (where coarse axes blur the differences
|
||||||
|
# that matter). Fully configurable on the node — trim or extend per use case.
|
||||||
|
# subject_count number of people
|
||||||
|
# gender_mix gender composition (e.g. 1F, 2F1M)
|
||||||
|
# body_type physique / build / proportions per subject
|
||||||
|
# distinctive_features tattoos / piercings / marks (identity anchors)
|
||||||
|
# age_appearance apparent age
|
||||||
|
# ethnicity_skin ethnicity / skin tone
|
||||||
|
# hair length, color, style
|
||||||
|
# clothing_state degree of undress + specific garments
|
||||||
|
# sexual_act the act / activity being performed
|
||||||
|
# position sexual position / arrangement of bodies
|
||||||
|
# penetration type & visibility of penetration
|
||||||
|
# explicitness how graphic / genital visibility level
|
||||||
|
# body_contact who contacts whom; interaction between subjects
|
||||||
|
# pose non-act body positioning
|
||||||
|
# facial_expression face / affect
|
||||||
|
# gaze eye contact / look direction
|
||||||
|
# framing shot type / crop (close-up <-> full body)
|
||||||
|
# camera_angle POV / angle / perspective
|
||||||
|
# scene location / setting / background
|
||||||
|
# lighting_color palette, lighting, color grade
|
||||||
|
# art_style photoreal vs anime/illustrated, render style
|
||||||
|
DEFAULT_AXES = (
|
||||||
|
"subject_count, gender_mix, body_type, distinctive_features, age_appearance, "
|
||||||
|
"ethnicity_skin, hair, clothing_state, sexual_act, position, penetration, "
|
||||||
|
"explicitness, body_contact, pose, facial_expression, gaze, framing, "
|
||||||
|
"camera_angle, scene, lighting_color, art_style"
|
||||||
|
)
|
||||||
|
|
||||||
# Cache loaded (model, processor) keyed by (path, precision) so the loop does not
|
# Cache loaded (model, processor) keyed by (path, precision) so the loop does not
|
||||||
# reload weights every iteration.
|
# reload weights every iteration.
|
||||||
@@ -196,27 +225,31 @@ def _ensure_chat_template(processor, model_path: str):
|
|||||||
|
|
||||||
|
|
||||||
def _build_system_prompt(axes: list[str]) -> str:
|
def _build_system_prompt(axes: list[str]) -> str:
|
||||||
axis_lines = "\n".join(f' "{a}": {{"score": <0..1>, "diff": "<short note>"}},' for a in axes)
|
axis_lines = "\n".join(
|
||||||
|
f' "{a}": {{"score": <0..1>, "ref": "<what IMAGE 1 shows>", "gen": "<what IMAGE 2 shows>"}},'
|
||||||
|
for a in axes)
|
||||||
return (
|
return (
|
||||||
"You are a meticulous visual-similarity judge for an image-generation "
|
"You are a meticulous visual-similarity judge for an image-generation "
|
||||||
"calibration loop. You are shown two images: IMAGE 1 is the REFERENCE "
|
"calibration loop. You are shown two images: IMAGE 1 is the REFERENCE "
|
||||||
"(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely "
|
"(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely "
|
||||||
"the GENERATED image reproduces the REFERENCE.\n\n"
|
"the GENERATED image reproduces the REFERENCE.\n\n"
|
||||||
"Score each axis from 0 to 1 using this anchored rubric:\n"
|
"For every axis report THREE things:\n"
|
||||||
" 0.0 = unrelated; 0.5 = same general category but clearly different "
|
" - ref: concretely what IMAGE 1 (reference / target) shows for this axis\n"
|
||||||
"details; 1.0 = near-identical.\n"
|
" - gen: concretely what IMAGE 2 (generated) shows for this axis\n"
|
||||||
"For each axis, FIRST note the concrete difference, THEN assign the number.\n\n"
|
" - score: 0..1 closeness, where 0.0 = unrelated, 0.5 = same general "
|
||||||
|
"category but clearly different details, 1.0 = near-identical.\n"
|
||||||
|
"Use specific concrete values (e.g. ref 'doggy style', gen 'missionary'), "
|
||||||
|
"not vague notes. Describe ONLY what you observe — do NOT suggest fixes or "
|
||||||
|
"prompt changes; correction is handled by a separate model.\n\n"
|
||||||
"Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n"
|
"Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n"
|
||||||
"{\n"
|
"{\n"
|
||||||
' "overall_score": <0..1>,\n'
|
' "overall_score": <0..1>,\n'
|
||||||
' "axes": {\n'
|
' "axes": {\n'
|
||||||
f"{axis_lines}\n"
|
f"{axis_lines}\n"
|
||||||
" },\n"
|
" }\n"
|
||||||
' "fix_suggestions": ["<actionable change to the generation prompt>", ...]\n'
|
|
||||||
"}\n"
|
"}\n"
|
||||||
"Phrase every diff and fix in terms of the named axes "
|
"overall_score must be consistent with the per-axis scores. If an axis is "
|
||||||
"(cast/clothing/pose/scene/composition/expression/color_light). "
|
"not applicable to either image, set score 1.0 and ref/gen to \"n/a\"."
|
||||||
"overall_score must be consistent with the per-axis scores."
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -311,7 +344,7 @@ def _merge_swapped(a: dict, b: dict) -> dict:
|
|||||||
return a
|
return a
|
||||||
if not a:
|
if not a:
|
||||||
return b
|
return b
|
||||||
out = {"axes": {}, "fix_suggestions": []}
|
out = {"axes": {}}
|
||||||
out["overall_score"] = round(
|
out["overall_score"] = round(
|
||||||
(float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4
|
(float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4
|
||||||
)
|
)
|
||||||
@@ -320,9 +353,11 @@ def _merge_swapped(a: dict, b: dict) -> dict:
|
|||||||
sa = a.get("axes", {}).get(ax, {})
|
sa = a.get("axes", {}).get(ax, {})
|
||||||
sb = b.get("axes", {}).get(ax, {})
|
sb = b.get("axes", {}).get(ax, {})
|
||||||
score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0
|
score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0
|
||||||
diff = sa.get("diff") or sb.get("diff") or ""
|
# In pass b the images were swapped, so b.ref describes the generated image
|
||||||
out["axes"][ax] = {"score": round(score, 4), "diff": diff}
|
# and b.gen the reference -> invert b when falling back.
|
||||||
out["fix_suggestions"] = (a.get("fix_suggestions") or []) + (b.get("fix_suggestions") or [])
|
ref = sa.get("ref") or sb.get("gen") or ""
|
||||||
|
gen = sa.get("gen") or sb.get("ref") or ""
|
||||||
|
out["axes"][ax] = {"score": round(score, 4), "ref": ref, "gen": gen}
|
||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
@@ -352,7 +387,6 @@ def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all,
|
|||||||
"run_tag": run_tag,
|
"run_tag": run_tag,
|
||||||
"overall_score": round(float(overall), 4),
|
"overall_score": round(float(overall), 4),
|
||||||
"axes": (merged or {}).get("axes", {}),
|
"axes": (merged or {}).get("axes", {}),
|
||||||
"fix_suggestions": (merged or {}).get("fix_suggestions", []),
|
|
||||||
"diff_analysis": diff_analysis,
|
"diff_analysis": diff_analysis,
|
||||||
"prompt_used": prompt_used,
|
"prompt_used": prompt_used,
|
||||||
"raw": raw_all,
|
"raw": raw_all,
|
||||||
@@ -395,7 +429,7 @@ class QwenVLImageJudge:
|
|||||||
"model_path": ("STRING", {"default": DEFAULT_MODEL_PATH}),
|
"model_path": ("STRING", {"default": DEFAULT_MODEL_PATH}),
|
||||||
"precision": (["bf16", "fp16", "fp8", "nf4"], {"default": "bf16"}),
|
"precision": (["bf16", "fp16", "fp8", "nf4"], {"default": "bf16"}),
|
||||||
"axes": ("STRING", {"default": DEFAULT_AXES, "multiline": True}),
|
"axes": ("STRING", {"default": DEFAULT_AXES, "multiline": True}),
|
||||||
"max_new_tokens": ("INT", {"default": 512, "min": 64, "max": 4096}),
|
"max_new_tokens": ("INT", {"default": 1024, "min": 64, "max": 4096}),
|
||||||
"temperature": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.5, "step": 0.05}),
|
"temperature": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.5, "step": 0.05}),
|
||||||
"swap_eval": ("BOOLEAN", {"default": True}),
|
"swap_eval": ("BOOLEAN", {"default": True}),
|
||||||
},
|
},
|
||||||
@@ -448,13 +482,13 @@ class QwenVLImageJudge:
|
|||||||
overall = float(merged.get("overall_score", 0.0)) if merged else 0.0
|
overall = float(merged.get("overall_score", 0.0)) if merged else 0.0
|
||||||
axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}"
|
axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}"
|
||||||
|
|
||||||
# Human/controller-readable diff summary.
|
# Human/controller-readable diff summary, worst axes first (biggest gap).
|
||||||
diff_lines = []
|
items = sorted((merged.get("axes", {}) if merged else {}).items(),
|
||||||
for ax, info in (merged.get("axes", {}) if merged else {}).items():
|
key=lambda kv: float(kv[1].get("score", 0)))
|
||||||
diff_lines.append(f"- {ax}: {info.get('score', 0):.2f} — {info.get('diff', '')}")
|
diff_lines = [
|
||||||
fixes = merged.get("fix_suggestions", []) if merged else []
|
f"- {ax}: {info.get('score', 0):.2f} ref:[{info.get('ref', '')}] gen:[{info.get('gen', '')}]"
|
||||||
if fixes:
|
for ax, info in items
|
||||||
diff_lines.append("fixes: " + "; ".join(str(f) for f in fixes))
|
]
|
||||||
diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)"
|
diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)"
|
||||||
|
|
||||||
report_path = _write_report(
|
report_path = _write_report(
|
||||||
|
|||||||
Reference in New Issue
Block a user