Files
ComfyUI-Prompt-Calibrator/docs/CALIBRATION_POLICY.md
T
Ethanfel d389d6daff Trim dead inputs: drop fp16 precision and prompt_used
fp16 offers nothing over bf16 for these models (removed from the quant dropdown;
loader still tolerant if passed). prompt_used was metadata-only — removed from the
node inputs, report payload/markdown, the bridge, and the example workflows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 10:03:06 +02:00

189 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Calibration policy — the agent's playbook
The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The
**external agent** (you / a stronger model) decides every correction. So the judge's job
is to hand the agent the *range of information needed to calibrate*, and the agent's job
is to turn that into prompt edits.
## What the agent needs from each comparison (the information model)
To move a generated image toward a reference, for **every dimension the prompt controls**
the agent needs three things:
| field | meaning | why the agent needs it |
|---|---|---|
| `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward |
| `gen` | what the **generated** image shows | the **current** state — what to change |
| `verdict` | `match` / `partial` / `mismatch` | which axes to fix first (mismatch → partial → match) |
That's the whole signal: *target, current, distance*. The agent corrects by rewriting the
prompt so `gen → ref` on the axes that differ.
**Model capability is the critical path.** Garbage descriptions in → garbage calibration
out. The **4B is too weak for fine-grained NSFW recognition**: it mislabels the verdict
(central-tendency bias toward `partial`) AND mis-identifies content — it will confidently
call a position "doggy" or "cowgirl" when it is neither. It's only reliable for *coarse*
attributes (subject count, nude/clothed, photoreal vs anime, broad scene). For anything
fine-grained — named positions, limb arrangement, gaze, hair detail — **use the 30B**
(`model_path=30b-a3b`, `precision=nf4`). The node corrects the trivially-wrong verdicts
(identical `ref`==`gen``match`), but it cannot fix a wrong *description*; only a more
capable model can.
**Grounded geometry, not named labels.** Naming a position (`doggy`/`cowgirl`) is
unreliable *even at 30B* — the named-label axis was removed. The pose cluster is now purely
observable geometry (`body_orientation` incl. who faces where, `limb_arrangement`,
`contact_points`, `pose`); compose a named position yourself from those primitives if you
need one. Geometry survives the model far better than the abstraction.
The axes must **span what the prompt can express** — you can only fix what the prompt can
say, and each diff must map to a lever. The default set (configurable on the node) is
grouped below.
## Axes (default set — edit `axes` on the node to taste)
- **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin`
- **Body:** `body_type`, `breast_size`, `distinctive_features` (tattoos/piercings/marks), `hair`
- **Wardrobe:** `clothing_state` (degree of undress + garments)
- **Action / pose (granular, observable geometry — no named labels):** `sexual_act`,
`body_orientation` (who on top/bottom/side + which way each faces),
`limb_arrangement` (legs spread/raised, hands), `penetration` (type/depth/angle),
`contact_points`, `genital_visibility`, `pose` (torso/head lean, arch)
- **Affect:** `facial_expression`, `gaze`
- **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle)
- **Render:** `scene`, `lighting_color`, `art_style`
Each axis carries a one-line definition in the prompt (so e.g. `gender_mix` is a *count*,
not a position). Coarse axes blur the differences that matter for adult imagery; the act /
pose cluster is split into many axes so the agent gets specific, actionable targets.
## Analysis profiles (pick the axis set for the act)
A discrete verdict collapses *magnitude*, and a generic axis can hide the very thing you're
calibrating: for a blowjob, `sexual_act` reads "oral" in both ref and gen → MATCH, even if
in the gen the head is 20 cm from the penis. So the judge has **analysis profiles** — act-
specialized axis sets whose act-critical axes are **distance/proximity-aware**:
| profile | adds (beyond the shared identity/body/render base) |
|---|---|
| `general` | sexual_act, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility, pose |
| `oral` | **mouth_genital_contact**, **mouth_genital_distance** (touching / <5cm / 1020cm / >20cm), **oral_depth** (tip/half/throat), tongue, hand_on_shaft, gaze_up |
| `penetration` | insertion_depth (tip/shallow/half/hilt), insertion_angle, body_orientation, limb_arrangement, … |
| `handjob` | hand_on_shaft, grip_style, stroke_position (base/mid/tip), mouth_genital_contact, … |
| `solo` | self_touch_location, toy_use, insertion_depth, … |
Now "mouth on the tip" vs "head 20 cm away" is a concrete, scored difference
(`mouth_genital_distance: mismatch ref:[contact] gen:[far >20cm]`) — the magnitude lives
in the `ref`/`gen` text. Set `profile` on the node (or `agent_bridge.py --profile oral`),
or override entirely with a custom `axes` list. Profiles are easy to extend in
`PROFILES`/`AXIS_DEFS` in `nodes/qwen_judge.py`.
## Step 0 — first pass (describe / bootstrap)
The very first iteration has no generated image yet, so the judge runs in **describe
mode**: it looks at the reference alone and emits **one canonical scene description**
a coherent, internally-consistent paragraph plus a per-axis target spec. That seeds
everything *and* becomes the fixed reference for the whole loop:
```bash
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
--run-tag seed --analysis-dir <report_dir>
```
`calib_seed.json` = `{"mode":"describe", "description":"…", "axes":{axis:value,…}, "canonical_reference":"…"}`
The agent takes `description` as the **initial prompt** and `axes` as the **initial
axis_state**. Crucially, the compare loop then **anchors on this canonical reference**
(via `--ref-desc-file`) instead of re-reading the reference image every iteration — so the
`ref` side never drifts or contradicts itself across passes; only the generated image is
re-described each turn.
## Per-iteration algorithm (greedy per-axis hill-climb)
```
best = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
i += 1
prompt = render(state) # state = current value per axis
report = run agent_bridge.py --prompt prompt --negative state.negative
--seed state.seed --run-tag iter{i}
--ref-desc-file <report_dir>/calib_seed.json # anchor on canonical ref
--workflow wf.json --analysis-dir <report_dir>
if report.mismatch_count == 0 and report.overall_score >= TARGET:
stop("converged", state) # TARGET e.g. 0.9 (mostly match)
if report.overall_score > best:
best = report.overall_score ; best_state = state ; stale = 0
else:
stale += 1 ; state = best_state # revert the change that didn't help
if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)
worst = a `mismatch` axis (else a `partial` axis) from report.axes
target_value = report.axes[worst].ref # what the reference shows
state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis
```
`edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt
wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position
phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.
### Rules that matter
1. **Change one axis per iteration** — clean attribution of the delta. Batch two only when
both are `mismatch` and clearly independent.
2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler
noise. Vary the seed only after converging, to confirm robustness.
3. **Always edit from `best_state`**, never from a worse last state.
4. **Prioritize `mismatch` axes, then `partial`.** Steer toward `ref`; if the obvious
wording doesn't flip the verdict, try an alternative phrasing before moving on.
5. **Trust the verdict + the ref/gen text, not fine score deltas.** The overall score is a
coarse mean; use `mismatch_count` falling as the real progress signal.
6. **Log every step**: `(iter, axis_changed, old→new, overall_score, mismatch_count)`.
## Worked example
```
iter1 overall=0.55 mism=6 worst: scene MISMATCH ref:[dim bedroom] gen:[bright kitchen]
edit scene → "dimly lit bedroom"
iter2 overall=0.63 mism=5 worst: body_orientation MISMATCH ref:[female on top, facing partner] gen:[female on bottom]
edit → "woman straddling on top, facing him"
iter3 overall=0.71 mism=3 worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight]
edit lighting → "warm low-key lighting" (mism=4 → revert)
iter4 retry lighting → "warm golden low-key glow" (mism=2 → keep, overall=0.82)
iter5 overall=0.88 mism=1 worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long]
edit hair → "curly shoulder-length brown hair"
iter6 overall=0.93 mism=0 ≥ target → STOP
```
## Report shape the agent reads (`latest.json` / stdout)
```json
{
"run_tag": "iter002",
"overall_score": 0.63,
"mismatch_count": 5,
"axes": {
"body_orientation": {"verdict": "mismatch", "ref": "female on top, facing partner", "gen": "female on bottom"},
"scene": {"verdict": "match", "ref": "dim bedroom", "gen": "dim bedroom"}
},
"_prompt_id": "...", "_report_path": "..."
}
```
## Agent system prompt (paste into your CLI agent)
> You are the controller for a local image prompt calibrator. Goal: make a generated
> image match a reference, measured by a Qwen3-VL judge that compares ~23 axes (identity,
> body, wardrobe, action/pose, affect, camera, render) and for each returns a `verdict`
> (match / partial / mismatch), `ref` (what the reference shows) and `gen` (what the
> generated shows). `overall_score` and `mismatch_count` are computed from the verdicts.
>
> You hold an **axis state** (current value per axis). Each turn: (1) render it to a
> prompt string; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
> --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>`;
> (3) read the printed JSON.
>
> Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else
> revert to the best state; pick a **mismatch** axis (else a **partial** axis) and rewrite
> that axis's prompt wording to match its `ref` value (you decide the wording — there are
> no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching.
> Stop when `mismatch_count == 0` and `overall_score ≥ TARGET` (default 0.9), or after
> PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.