e4dfaac63b
The 4B over-uses 'partial' (mislabels identical ref/gen and clear opposites) and also mis-identifies fine-grained content (e.g. names a position 'doggy'/'cowgirl' when it is neither). Deterministic fix: force verdict=match when normalized ref==gen. Prompt hardened to not default to 'partial' (opposites=mismatch). Docs: the 4B is only reliable for coarse attributes — use the 30B for fine-grained recognition; prefer grounded geometry axes over named-position labels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
167 lines
8.9 KiB
Markdown
167 lines
8.9 KiB
Markdown
# Calibration policy — the agent's playbook
|
|
|
|
The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
|
|
**external agent** (you / a stronger model) decides every correction. So the judge's job
|
|
is to hand the agent the *range of information needed to calibrate*, and the agent's job
|
|
is to turn that into prompt edits.
|
|
|
|
## What the agent needs from each comparison (the information model)
|
|
|
|
To move a generated image toward a reference, for **every dimension the prompt controls**
|
|
the agent needs three things:
|
|
|
|
| field | meaning | why the agent needs it |
|
|
|---|---|---|
|
|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
|
|
| `gen` | what the **generated** image shows | the **current** state — what to change |
|
|
| `verdict` | `match` / `partial` / `mismatch` | which axes to fix first (mismatch → partial → match) |
|
|
|
|
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
|
|
prompt so `gen → ref` on the axes that differ.
|
|
|
|
**Model capability is the critical path.** Garbage descriptions in → garbage calibration
|
|
out. The **4B is too weak for fine-grained NSFW recognition**: it mislabels the verdict
|
|
(central-tendency bias toward `partial`) AND mis-identifies content — it will confidently
|
|
call a position "doggy" or "cowgirl" when it is neither. It's only reliable for *coarse*
|
|
attributes (subject count, nude/clothed, photoreal vs anime, broad scene). For anything
|
|
fine-grained — named positions, limb arrangement, gaze, hair detail — **use the 30B**
|
|
(`model_path=30b-a3b`, `precision=nf4`). The node corrects the trivially-wrong verdicts
|
|
(identical `ref`==`gen` → `match`), but it cannot fix a wrong *description*; only a more
|
|
capable model can.
|
|
|
|
**Prefer grounded geometry over named labels.** A named position (`position_name`) forces
|
|
the model to classify into a vocabulary it gets wrong; observable geometry
|
|
(`body_orientation`, `limb_arrangement`, `contact_points`, who faces where) is more
|
|
grounded and survives a weaker model better. Weight those axes over the named label.
|
|
|
|
The axes must **span what the prompt can express** — you can only fix what the prompt can
|
|
say, and each diff must map to a lever. The default set (configurable on the node) is
|
|
grouped below.
|
|
|
|
## Axes (default set — edit `axes` on the node to taste)
|
|
|
|
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
|
|
- **Body:** `body_type`, `breast_size`, `distinctive_features` (tattoos/piercings/marks), `hair`
|
|
- **Wardrobe:** `clothing_state` (degree of undress + garments)
|
|
- **Action / pose (where explicit content concentrates — kept granular):** `sexual_act`,
|
|
`position_name` (doggy/cowgirl/…), `body_orientation` (on top/from behind/…),
|
|
`limb_arrangement` (legs spread/raised, hands), `penetration` (type/depth/angle),
|
|
`contact_points`, `genital_visibility`, `pose` (torso/head lean)
|
|
- **Affect:** `facial_expression`, `gaze`
|
|
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
|
|
- **Render:** `scene`, `lighting_color`, `art_style`
|
|
|
|
Each axis carries a one-line definition in the prompt (so e.g. `gender_mix` is a *count*,
|
|
not a position). Coarse axes blur the differences that matter for adult imagery; the act /
|
|
pose cluster is split into many axes so the agent gets specific, actionable targets.
|
|
|
|
## Step 0 — first pass (describe / bootstrap)
|
|
|
|
The very first iteration has no generated image yet, so the judge runs in **describe
|
|
mode**: it looks at the reference alone and emits **one canonical scene description** —
|
|
a coherent, internally-consistent paragraph plus a per-axis target spec. That seeds
|
|
everything *and* becomes the fixed reference for the whole loop:
|
|
|
|
```bash
|
|
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
|
|
--run-tag seed --analysis-dir <report_dir>
|
|
```
|
|
→ `calib_seed.json` = `{"mode":"describe", "description":"…", "axes":{axis:value,…}, "canonical_reference":"…"}`
|
|
|
|
The agent takes `description` as the **initial prompt** and `axes` as the **initial
|
|
axis_state**. Crucially, the compare loop then **anchors on this canonical reference**
|
|
(via `--ref-desc-file`) instead of re-reading the reference image every iteration — so the
|
|
`ref` side never drifts or contradicts itself across passes; only the generated image is
|
|
re-described each turn.
|
|
|
|
## Per-iteration algorithm (greedy per-axis hill-climb)
|
|
|
|
```
|
|
best = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
|
loop:
|
|
i += 1
|
|
prompt = render(state) # state = current value per axis
|
|
report = run agent_bridge.py --prompt prompt --negative state.negative
|
|
--seed state.seed --run-tag iter{i}
|
|
--ref-desc-file <report_dir>/calib_seed.json # anchor on canonical ref
|
|
--workflow wf.json --analysis-dir <report_dir>
|
|
if report.mismatch_count == 0 and report.overall_score >= TARGET:
|
|
stop("converged", state) # TARGET e.g. 0.9 (mostly match)
|
|
if report.overall_score > best:
|
|
best = report.overall_score ; best_state = state ; stale = 0
|
|
else:
|
|
stale += 1 ; state = best_state # revert the change that didn't help
|
|
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
|
|
|
|
worst = a `mismatch` axis (else a `partial` axis) from report.axes
|
|
target_value = report.axes[worst].ref # what the reference shows
|
|
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
|
|
```
|
|
|
|
`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
|
|
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
|
|
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
|
|
|
|
### Rules that matter
|
|
|
|
1. **Change one axis per iteration** — clean attribution of the delta. Batch two only when
|
|
both are `mismatch` and clearly independent.
|
|
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
|
|
noise. Vary the seed only after converging, to confirm robustness.
|
|
3. **Always edit from `best_state`**, never from a worse last state.
|
|
4. **Prioritize `mismatch` axes, then `partial`.** Steer toward `ref`; if the obvious
|
|
wording doesn't flip the verdict, try an alternative phrasing before moving on.
|
|
5. **Trust the verdict + the ref/gen text, not fine score deltas.** The overall score is a
|
|
coarse mean; use `mismatch_count` falling as the real progress signal.
|
|
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, mismatch_count)`.
|
|
|
|
## Worked example
|
|
|
|
```
|
|
iter1 overall=0.55 mism=6 worst: scene MISMATCH ref:[dim bedroom] gen:[bright kitchen]
|
|
edit scene → "dimly lit bedroom"
|
|
iter2 overall=0.63 mism=5 worst: position_name MISMATCH ref:[doggy style] gen:[cowgirl]
|
|
edit position → "doggy style, from behind"
|
|
iter3 overall=0.71 mism=3 worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight]
|
|
edit lighting → "warm low-key lighting" (mism=4 → revert)
|
|
iter4 retry lighting → "warm golden low-key glow" (mism=2 → keep, overall=0.82)
|
|
iter5 overall=0.88 mism=1 worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long]
|
|
edit hair → "curly shoulder-length brown hair"
|
|
iter6 overall=0.93 mism=0 ≥ target → STOP
|
|
```
|
|
|
|
## Report shape the agent reads (`latest.json` / stdout)
|
|
|
|
```json
|
|
{
|
|
"run_tag": "iter002",
|
|
"overall_score": 0.63,
|
|
"mismatch_count": 5,
|
|
"axes": {
|
|
"position_name": {"verdict": "mismatch", "ref": "doggy style", "gen": "cowgirl"},
|
|
"scene": {"verdict": "match", "ref": "dim bedroom", "gen": "dim bedroom"}
|
|
},
|
|
"prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
|
|
}
|
|
```
|
|
|
|
## Agent system prompt (paste into your CLI agent)
|
|
|
|
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
|
> image match a reference, measured by a Qwen3-VL judge that compares ~24 axes (identity,
|
|
> body, wardrobe, action/pose, affect, camera, render) and for each returns a `verdict`
|
|
> (match / partial / mismatch), `ref` (what the reference shows) and `gen` (what the
|
|
> generated shows). `overall_score` and `mismatch_count` are computed from the verdicts.
|
|
>
|
|
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
|
|
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
|
> --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
|
|
> (3) read the printed JSON.
|
|
>
|
|
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
|
|
> revert to the best state; pick a **mismatch** axis (else a **partial** axis) and rewrite
|
|
> that axis's prompt wording to match its `ref` value (you decide the wording — there are
|
|
> no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching.
|
|
> Stop when `mismatch_count == 0` and `overall_score ≥ TARGET` (default 0.9), or after
|
|
> PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.
|