Files

T

Ethanfel 959ec70065 Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger
external agent. Each axis reports the target value (ref), the current value (gen)
and the closeness (score) — the target/current/distance an agent needs to
calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/
camera/render) so the action cluster stays discriminative for explicit content.
swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first;
default max_new_tokens 1024. Docs aligned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-26 22:52:40 +02:00

6.3 KiB

Raw Blame History

Calibration policy — the agent's playbook

The local Qwen3-VL judge only observes and scores — it does not propose fixes. The external agent (you / a stronger model) decides every correction. So the judge's job is to hand the agent the range of information needed to calibrate, and the agent's job is to turn that into prompt edits.

What the agent needs from each comparison (the information model)

To move a generated image toward a reference, for every dimension the prompt controls the agent needs three things:

field	meaning	why the agent needs it
`ref`	what the reference shows on this axis	the target — what to steer the prompt toward
`gen`	what the generated image shows	the current state — what to change
`score`	0–1 closeness	the gap / priority — which axes to fix first

That's the whole signal: target, current, distance. The agent corrects by rewriting the prompt so gen → ref on the lowest-scoring axes. The judge returns exactly this per axis ({"score", "ref", "gen"}) plus a top-level overall_score.

The axes must span what the prompt can express — you can only fix what the prompt can say, and each diff must map to a lever. The default set (configurable on the node) is grouped below.

Axes (default set — edit `axes` on the node to taste)

Identity / cast: subject_count, gender_mix, age_appearance, ethnicity_skin
Body: body_type, distinctive_features (tattoos/piercings/marks), hair
Wardrobe: clothing_state (degree of undress + garments)
Action (where explicit content concentrates): sexual_act, position, penetration, explicitness, body_contact
Affect: pose, facial_expression, gaze
Camera: framing (shot/crop), camera_angle (POV/angle)
Render: scene, lighting_color, art_style

Coarse axes blur the differences that matter for adult imagery; this set keeps the act / interaction cluster granular so the agent gets actionable targets.

Per-iteration algorithm (greedy per-axis hill-climb)

best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
loop:
  i += 1
  prompt = render(state)                       # state = current value per axis
  report = run agent_bridge.py --prompt prompt --negative state.negative
                               --seed state.seed --run-tag iter{i}
                               --workflow wf.json --analysis-dir <report_dir>
  if report.overall_score >= TARGET: stop("converged", state)         # e.g. 0.85
  if report.overall_score > best_score:
      best_score = report.overall_score ; best_state = state ; stale = 0
  else:
      stale += 1 ; state = best_state          # revert the change that didn't help
  if stale >= PATIENCE or i >= MAX_ITERS: stop("plateau/budget", best_state)

  worst = axis with the lowest report.axes[*].score
  target_value = report.axes[worst].ref         # what the reference shows
  state = apply(best_state, worst, edit_toward(target_value))   # change ONE axis

edit_toward(ref) is the agent's own reasoning: translate the reference value into prompt wording for that axis (e.g. gen:[missionary] → ref:[doggy style] ⇒ set the position phrase to "doggy style"). No machine-supplied fix list — the agent owns this step.

Rules that matter

Change one axis per iteration — clean attribution of the score delta. Batch two only when both are very low and clearly independent.
Freeze seed while searching — the score must reflect the prompt, not sampler noise. Vary the seed only after converging, to confirm robustness.
Always edit from best_state, never from a worse last state.
Steer toward ref on the worst axis; if the obvious wording doesn't move the score after a try, try an alternative phrasing for that axis before moving on.
Near the margin, don't over-trust one reading. swap_eval already averages two orderings; if two candidates are within ~0.03, re-run each on a second seed.
Log every step: (iter, axis_changed, old→new, overall_score, worst-axes).

Worked example

iter1  overall=0.41   worst: scene 0.30  ref:[dim bedroom]   gen:[bright kitchen]
       edit scene → "dimly lit bedroom"
iter2  overall=0.58   worst: position 0.35  ref:[doggy style] gen:[missionary]
       edit position → "doggy style"
iter3  overall=0.71   worst: lighting_color 0.50 ref:[warm low-key] gen:[flat daylight]
       edit lighting → "warm low-key lighting"   (0.69 → revert)
iter4  overall=0.69   retry lighting → "warm golden low-key glow"   (0.84 → keep)
iter5  overall=0.84   worst: clothing_state 0.80 ref:[red lace lingerie] gen:[plain bra]
       edit clothing → "red lace lingerie"
iter6  overall=0.89   ≥ target → STOP

Report shape the agent reads (`latest.json` / stdout)

{
  "run_tag": "iter002",
  "overall_score": 0.58,
  "axes": {
    "position": {"score": 0.35, "ref": "doggy style", "gen": "missionary"},
    "scene":    {"score": 0.92, "ref": "dim bedroom", "gen": "dim bedroom"}
  },
  "prompt_used": "...", "_prompt_id": "...", "_report_path": "..."
}

Agent system prompt (paste into your CLI agent)

You are the controller for a local image prompt calibrator. Goal: make a generated image match a reference, measured by a Qwen3-VL judge that scores ~20 axes (identity, body, wardrobe, action, affect, camera, render) and for each returns score (0–1 closeness), ref (what the reference shows) and gen (what the generated shows).

You hold an axis state (current value per axis). Each turn: (1) render it to a prompt string; (2) run python agent_bridge.py --workflow <wf> --prompt "<rendered>" --negative "<neg>" --seed <seed> --run-tag iter<N> --analysis-dir <report_dir>; (3) read the printed JSON.

Then greedy per-axis hill-climb: keep the change only if overall_score improved, else revert to the best state; pick the lowest-scoring axis and rewrite that axis's prompt wording to match its ref value (you decide the wording — there are no machine-supplied fixes). Change ONE axis per turn. Keep the seed fixed while searching. Stop at overall_score ≥ TARGET (default 0.85), PATIENCE=4 non-improving turns, or MAX_ITERS=25. Log every step and report the best prompt + score.

6.3 KiB Raw Blame History Unescape Escape