# Calibration policy — the agent's playbook The local Qwen3-VL judge only **observes and scores** — it does not propose fixes. The **external agent** (you / a stronger model) decides every correction. So the judge's job is to hand the agent the *range of information needed to calibrate*, and the agent's job is to turn that into prompt edits. ## What the agent needs from each comparison (the information model) To move a generated image toward a reference, for **every dimension the prompt controls** the agent needs three things: | field | meaning | why the agent needs it | |---|---|---| | `ref` | what the **reference** shows on this axis | the **target** — what to steer the prompt toward | | `gen` | what the **generated** image shows | the **current** state — what to change | | `verdict` | `match` / `partial` / `mismatch` | which axes to fix first (mismatch → partial → match) | That's the whole signal: *target, current, distance*. The agent corrects by rewriting the prompt so `gen → ref` on the axes that differ. **Model capability is the critical path.** Garbage descriptions in → garbage calibration out. The **4B is too weak for fine-grained NSFW recognition**: it mislabels the verdict (central-tendency bias toward `partial`) AND mis-identifies content — it will confidently call a position "doggy" or "cowgirl" when it is neither. It's only reliable for *coarse* attributes (subject count, nude/clothed, photoreal vs anime, broad scene). For anything fine-grained — named positions, limb arrangement, gaze, hair detail — **use the 30B** (`model_path=30b-a3b`, `precision=nf4`). The node corrects the trivially-wrong verdicts (identical `ref`==`gen` → `match`), but it cannot fix a wrong *description*; only a more capable model can. **Grounded geometry, not named labels.** Naming a position (`doggy`/`cowgirl`) is unreliable *even at 30B* — the named-label axis was removed. The pose cluster is now purely observable geometry (`body_orientation` incl. who faces where, `limb_arrangement`, `contact_points`, `pose`); compose a named position yourself from those primitives if you need one. Geometry survives the model far better than the abstraction. The axes must **span what the prompt can express** — you can only fix what the prompt can say, and each diff must map to a lever. The default set (configurable on the node) is grouped below. ## Axes (default set — edit `axes` on the node to taste) - **Identity / cast:** `subject_count`, `gender_mix`, `age_appearance`, `ethnicity_skin` - **Body:** `body_type`, `breast_size`, `distinctive_features` (tattoos/piercings/marks), `hair` - **Wardrobe:** `clothing_state` (degree of undress + garments) - **Action / pose (granular, observable geometry — no named labels):** `sexual_act`, `body_orientation` (who on top/bottom/side + which way each faces), `limb_arrangement` (legs spread/raised, hands), `penetration` (type/depth/angle), `contact_points`, `genital_visibility`, `pose` (torso/head lean, arch) - **Affect:** `facial_expression`, `gaze` - **Camera:** `framing` (shot/crop), `camera_angle` (POV/angle) - **Render:** `scene`, `lighting_color`, `art_style` Each axis carries a one-line definition in the prompt (so e.g. `gender_mix` is a *count*, not a position). Coarse axes blur the differences that matter for adult imagery; the act / pose cluster is split into many axes so the agent gets specific, actionable targets. ## Analysis profiles (pick the axis set for the act) A discrete verdict collapses *magnitude*, and a generic axis can hide the very thing you're calibrating: for a blowjob, `sexual_act` reads "oral" in both ref and gen → MATCH, even if in the gen the head is 20 cm from the penis. So the judge has **analysis profiles** — act- specialized axis sets whose act-critical axes are **distance/proximity-aware**: | profile | adds (beyond the shared identity/body/render base) | |---|---| | `general` | sexual_act, body_orientation, limb_arrangement, penetration, contact_points, genital_visibility, pose | | `oral` | **mouth_genital_contact**, **mouth_genital_distance** (touching / <5cm / 10–20cm / >20cm), **oral_depth** (tip/half/throat), tongue, hand_on_shaft, gaze_up | | `penetration` | insertion_depth (tip/shallow/half/hilt), insertion_angle, body_orientation, limb_arrangement, … | | `handjob` | hand_on_shaft, grip_style, stroke_position (base/mid/tip), mouth_genital_contact, … | | `solo` | self_touch_location, toy_use, insertion_depth, … | Now "mouth on the tip" vs "head 20 cm away" is a concrete, scored difference (`mouth_genital_distance: mismatch ref:[contact] gen:[far >20cm]`) — the magnitude lives in the `ref`/`gen` text. Set `profile` on the node (or `agent_bridge.py --profile oral`), or override entirely with a custom `axes` list. Profiles are easy to extend in `PROFILES`/`AXIS_DEFS` in `nodes/qwen_judge.py`. ## Step 0 — first pass (describe / bootstrap) The very first iteration has no generated image yet, so the judge runs in **describe mode**: it looks at the reference alone and emits **one canonical scene description** — a coherent, internally-consistent paragraph plus a per-axis target spec. That seeds everything *and* becomes the fixed reference for the whole loop: ```bash python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \ --run-tag seed --analysis-dir ``` → `calib_seed.json` = `{"mode":"describe", "description":"…", "axes":{axis:value,…}, "canonical_reference":"…"}` The agent takes `description` as the **initial prompt** and `axes` as the **initial axis_state**. Crucially, the compare loop then **anchors on this canonical reference** (via `--ref-desc-file`) instead of re-reading the reference image every iteration — so the `ref` side never drifts or contradicts itself across passes; only the generated image is re-described each turn. ## Per-iteration algorithm (greedy per-axis hill-climb) ``` best = -1 ; best_state = initial_state ; stale = 0 ; i = 0 loop: i += 1 prompt = render(state) # state = current value per axis report = run agent_bridge.py --prompt prompt --negative state.negative --seed state.seed --run-tag iter{i} --ref-desc-file /calib_seed.json # anchor on canonical ref --workflow wf.json --analysis-dir if report.mismatch_count == 0 and report.overall_score >= TARGET: stop("converged", state) # TARGET e.g. 0.9 (mostly match) if report.overall_score > best: best = report.overall_score ; best_state = state ; stale = 0 else: stale += 1 ; state = best_state # revert the change that didn't help if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state) worst = a `mismatch` axis (else a `partial` axis) from report.axes target_value = report.axes[worst].ref # what the reference shows state = apply(best_state, worst, edit_toward(target_value)) # change ONE axis ``` `edit_toward(ref)` is the agent's own reasoning: translate the reference value into prompt wording for that axis (e.g. `gen:[missionary] → ref:[doggy style]` ⇒ set the position phrase to "doggy style"). No machine-supplied fix list — the agent owns this step. ### Rules that matter 1. **Change one axis per iteration** — clean attribution of the delta. Batch two only when both are `mismatch` and clearly independent. 2. **Freeze `seed` while searching** — the score must reflect the prompt, not sampler noise. Vary the seed only after converging, to confirm robustness. 3. **Always edit from `best_state`**, never from a worse last state. 4. **Prioritize `mismatch` axes, then `partial`.** Steer toward `ref`; if the obvious wording doesn't flip the verdict, try an alternative phrasing before moving on. 5. **Trust the verdict + the ref/gen text, not fine score deltas.** The overall score is a coarse mean; use `mismatch_count` falling as the real progress signal. 6. **Log every step**: `(iter, axis_changed, old→new, overall_score, mismatch_count)`. ## Worked example ``` iter1 overall=0.55 mism=6 worst: scene MISMATCH ref:[dim bedroom] gen:[bright kitchen] edit scene → "dimly lit bedroom" iter2 overall=0.63 mism=5 worst: body_orientation MISMATCH ref:[female on top, facing partner] gen:[female on bottom] edit → "woman straddling on top, facing him" iter3 overall=0.71 mism=3 worst: lighting_color MISMATCH ref:[warm low-key] gen:[flat daylight] edit lighting → "warm low-key lighting" (mism=4 → revert) iter4 retry lighting → "warm golden low-key glow" (mism=2 → keep, overall=0.82) iter5 overall=0.88 mism=1 worst: hair PARTIAL ref:[curly shoulder-length] gen:[straight long] edit hair → "curly shoulder-length brown hair" iter6 overall=0.93 mism=0 ≥ target → STOP ``` ## Report shape the agent reads (`latest.json` / stdout) ```json { "run_tag": "iter002", "overall_score": 0.63, "mismatch_count": 5, "axes": { "body_orientation": {"verdict": "mismatch", "ref": "female on top, facing partner", "gen": "female on bottom"}, "scene": {"verdict": "match", "ref": "dim bedroom", "gen": "dim bedroom"} }, "_prompt_id": "...", "_report_path": "..." } ``` ## Agent system prompt (paste into your CLI agent) > You are the controller for a local image prompt calibrator. Goal: make a generated > image match a reference, measured by a Qwen3-VL judge that compares ~23 axes (identity, > body, wardrobe, action/pose, affect, camera, render) and for each returns a `verdict` > (match / partial / mismatch), `ref` (what the reference shows) and `gen` (what the > generated shows). `overall_score` and `mismatch_count` are computed from the verdicts. > > You hold an **axis state** (current value per axis). Each turn: (1) render it to a > prompt string; (2) run `python agent_bridge.py --workflow --prompt "" > --negative "" --seed --run-tag iter --analysis-dir `; > (3) read the printed JSON. > > Then greedy per-axis hill-climb: keep the change only if `overall_score` improved, else > revert to the best state; pick a **mismatch** axis (else a **partial** axis) and rewrite > that axis's prompt wording to match its `ref` value (you decide the wording — there are > no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching. > Stop when `mismatch_count == 0` and `overall_score ≥ TARGET` (default 0.9), or after > PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step; report best prompt + score.