Even the 30B mis-identifies named sex positions (doggy/cowgirl) from images, so position_name is removed. The pose cluster is now purely observable geometry: body_orientation enriched with facing direction (who faces whom), plus limb_arrangement / contact_points / pose. The agent composes any named label from these reliable primitives. 23 default axes. Docs/examples updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 KiB
Local Prompt Calibrator — Methodology
Goal: a fully local ComfyUI feedback loop where a vision‑language model (VLM) scores how close a generated image is to a reference image, and that score + a structured difference analysis is used to calibrate the prompt‑generation method (ComfyUI‑Prompt‑Builder, the "SxCP" nodes) until the generated image matches the reference.
1. The loop at a glance
┌──────────────────────────────────────────────┐
│ REFERENCE image (the target look) │
└───────────────┬──────────────────────────────┘
│
┌────────────────────▼────────────────┐ calibration deltas
│ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob
│ seeded pools + profile knobs │ overrides / seed move)
└────────────────────┬────────────────┘
│ prompt + negative
┌────────────────────▼────────────────┐
│ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while
└────────────────────┬────────────────┘ searching the prompt axes
│ generated image
┌────────────────────▼──────────────────────────────────┐
│ Qwen3-VL JUDGE node ── the "vllm node" │
│ in : reference + generated │
│ out: overall_score 0..1 │
│ per-axis {score, ref, gen} over ~20 axes │
│ (identity, body, wardrobe, action, affect, │
│ camera, render) — target vs current values │
│ (local model observes only; no fixes suggested) │
└────────────────────┬──────────────────────────────────┘
│ score + ref/gen per axis
┌────────────────────▼────────────────┐
│ CALIBRATOR / controller │
│ - accumulate per-axis scores │
│ - map diffs → axis adjustments │
│ - update Prompt-Builder knobs │
│ - stop when overall_score ≥ target │
│ or max iterations reached │
└──────────────────────────────────────┘
The novel piece is the Judge node. Off‑the‑shelf Qwen‑VL nodes emit free text;
a calibrator needs a machine‑readable score + per‑axis diffs so the controller
can act on them. That is what nodes/qwen_judge.py in this repo provides.
2. The VLLM node — what to reuse
You already have the model converted locally:
/media/p5/qwen3vl_4b_abliterated_comfy_convert/
├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
└── hf_fp8/ ← same model, FP8 (≈4–5 GB, trivially fits the RTX 5090 32 GB)
The abliterated variant matters: stock Qwen3‑VL will often refuse to "describe or analyze" adult imagery, which would break the loop. huihui‑ai removed the text‑side refusal direction, so it scores NSFW reference/generated pairs without bailing.
Reusable ComfyUI nodes (pick one as the plumbing base)
| Repo | Backend | Multi‑image | Local path | Notes |
|---|---|---|---|---|
| hardik-uppal/ComfyUI-QwenVL-MultiImage | transformers | ✅ images + images_batch_2/3 |
needs tiny tweak | Best base — built for "compare these images, describe the differences"; supports FP16 / 8‑bit / 4‑bit and pre‑quantized FP8 (matches your hf_fp8). |
| IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct | transformers | ✅ multi‑image query | HF download | Clean native Qwen3‑VL‑Instruct integration. |
| jren712/ComfyUI-QwenVL-abliterated | transformers | ✅ | abliterated‑oriented | Fork tuned for the abliterated weights. |
| 1038lab/ComfyUI-QwenVL | GGUF (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. |
Recommendation: don't run any of them as‑is for the loop — they only output text.
Instead reuse their model‑load + apply_chat_template + generate plumbing inside
a purpose‑built Judge node (this repo) that forces structured JSON output. The
ComfyUI-QwenVL-MultiImage loader is the closest template (it already handles two
image batches + FP8).
Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL
As of June 2026 the latest Qwen VL family is Qwen3‑VL (Qwen3.5‑VL shipped early 2026, but abliterated builds of it are text‑only so far — no uncensored Qwen3.5‑VL yet). So "latest + uncensored + fits 32 GB" = Qwen3‑VL‑30B‑A3B abliterated. All rows below are huihui‑ai abliterated (uncensored) weights:
| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict |
|---|---|---|---|
| Qwen3‑VL‑30B‑A3B‑Instruct (HF) | nf4 (4‑bit) or GGUF Q4_K_M | ~18 GB | Best judge that fits. MoE → only 3B active, so it's fast despite 30B total. transformers class Qwen3VLMoeForConditionalGeneration (auto‑detected by the node). |
| Qwen3‑VL‑8B‑Instruct (HF) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; drop‑in for the judge node. |
| Qwen3‑VL‑4B‑Instruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. |
Gemma alternative: Gemma‑3‑27B‑it (abliterated, 4‑bit ~16 GB) is a solid different visual prior if you want a second opinion, but the Krea2 text encoder + Prompt‑Builder are already Qwen‑aligned, so staying on Qwen3‑VL keeps the vocabulary consistent.
Download an upgrade and point the node's model_path at it:
hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
--local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
# then in the Judge node: model_path=<that dir>, precision=nf4
Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always co‑reside — run them as separate queue steps and let ComfyUI unload between; the loop is sequential anyway. The 8B bf16 judge co‑resides more easily.
3. Scoring rubric (what the VLM actually returns)
The judge prompts Qwen3‑VL to return strict JSON with one overall score and, per axis,
the target value (ref), the current value (gen), and the gap (score) — exactly
the target / current / distance an agent needs to calibrate. The local model only
observes; it suggests no fixes (a stronger external model owns correction).
{
"axes": {
"subject_count": {"verdict": "match", "ref": "1 woman", "gen": "1 woman"},
"body_orientation":{"verdict": "mismatch", "ref": "female on top, facing partner", "gen": "female on bottom"},
"clothing_state": {"verdict": "mismatch", "ref": "red lace lingerie", "gen": "nude"},
"scene": {"verdict": "partial", "ref": "dim bedroom", "gen": "lit bedroom"},
"lighting_color": {"verdict": "match", "ref": "warm low-key", "gen": "warm low-key"}
}
}
A discrete verdict (match/partial/mismatch) is used instead of a 0–1 score: small VLMs
give unreliable fine scores (identical ref/gen often scored ~0.6) but classify the three
buckets reliably. overall_score + mismatch_count are computed from the verdicts on our
side (mean ordinal), so they're trustworthy as a stop signal. The axis list is
configurable; the default ~23 axes are grouped identity / body / wardrobe / action·pose
/ affect / camera / render, with the action·pose cluster split fine (sexual_act,
body_orientation, limb_arrangement, penetration, contact_points,
genital_visibility) so it stays discriminative for explicit content. Each axis carries a
one-line definition in the prompt. The agent steers each mismatch/partial axis toward
its ref. See CALIBRATION_POLICY.md.
Reducing VLM‑as‑judge variance (important)
VLM scoring is noisy and biased. Mitigations baked into the node / recommended:
- Position‑bias swap — run the judge twice with reference/generated order swapped and
average the per‑axis scores (
swap_eval=True). Cuts the "first image wins" bias. - Low temperature (0.0–0.3) + a fixed rubric in the system prompt → repeatable scores.
- Anchored 0–1 rubric (0 = unrelated, 0.5 = same category/different details, 1 = near‑identical) so scores are comparable across iterations.
- Evidence‑first: ask the model to state the concrete difference before the number; reasoning‑then‑score is measurably more reliable than score‑then‑reasoning.
- Average over k T2I seeds for the same prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, freeze the T2I seed during the axis search and only vary it once at the end.
4. The calibrator / controller
Chosen design: the controller is an external CLI agent, not an in‑graph node. The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into the
CalibratorPromptReceptornode, and queues ComfyUI via its HTTP API — oneprompt_idper iteration. See AGENT_LOOP.md andagent_bridge.py. The options below describe the policy the agent can run.
Prompt‑Builder is a deterministic, seeded, combinatorial generator (it is not an
LLM). So "calibration" = searching the space of (seed, profile, per‑axis overrides)
to maximize overall_score. Three controller options, easiest → strongest:
-
Greedy per‑axis hill‑climb (start here). Take the lowest‑scoring axis, rewrite that axis's prompt wording toward its
ref(target) value, regenerate, keep the change ifoverall_scoreimproved, else revert. Loop until ≥ target or no axis improves. The agent decides the wording (no machine fixes). Implementable with the Prompt‑Builder For‑Loop Start/End + Accumulator nodes. -
Black‑box optimizer over the knob vector. Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES / a simple bandit, objective =
overall_score. Better for >3–4 interacting axes; needs a thin Python controller node that holds state across iterations. -
LLM‑in‑the‑loop rewriter. Feed
diff_analysisto a (local) text LLM that proposes the next knob settings (or, if you move to free‑text prompts, rewrites the prompt). Most flexible, least reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.
Loop hygiene: fix resolution/sampler/steps across iterations; freeze T2I seed while
searching; stop on overall_score ≥ target (e.g. 0.85) or max_iters; log every
(knobs, score, diff) triple so the search is auditable and resumable.
5. Concrete build order
- Judge node (this repo,
nodes/qwen_judge.py) — load local Qwen3‑VL‑4B abliterated, take ref+gen, outputoverall_score (FLOAT),axis_scores (JSON STRING),diff_analysis (STRING),raw (STRING). ✅ scaffolded. - Wire the loop in a workflow: Prompt‑Builder → T2I → Judge → Accumulator, using the
SxCP For‑Loop nodes; route
overall_scoreinto the loop's stop condition. - Controller node — start with greedy per‑axis hill‑climb that reads
diff_analysisand emits knob overrides back into Prompt‑Builder's split control nodes. - Tune the judge — calibrate the rubric on a handful of known ref/gen pairs; enable
swap_eval; pick temperature; decide if you need to step up to 8B/30B‑A3B.
See README.md for install/usage of the Judge node.