Initial commit: VLM-as-judge prompt calibration loop

Qwen3-VL image-similarity judge node, external-prompt receptor node,
agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/
calibration-policy docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-26 22:15:56 +02:00
commit 95198a15b5
13 changed files with 1294 additions and 0 deletions
+198
View File
@@ -0,0 +1,198 @@
# Local Prompt Calibrator — Methodology
> Goal: a **fully local** ComfyUI feedback loop where a visionlanguage model (VLM)
> scores how close a *generated* image is to a *reference* image, and that score +
> a structured difference analysis is used to **calibrate the promptgeneration
> method** ([ComfyUIPromptBuilder](../../ComfyUI-Prompt-Builder), the "SxCP" nodes)
> until the generated image matches the reference.
---
## 1. The loop at a glance
```
┌──────────────────────────────────────────────┐
│ REFERENCE image (the target look) │
└───────────────┬──────────────────────────────┘
┌────────────────────▼────────────────┐ calibration deltas
│ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob
│ seeded pools + profile knobs │ overrides / seed move)
└────────────────────┬────────────────┘
│ prompt + negative
┌────────────────────▼────────────────┐
│ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while
└────────────────────┬────────────────┘ searching the prompt axes
│ generated image
┌────────────────────▼──────────────────────────────────┐
│ Qwen3-VL JUDGE node ── the "vllm node" │
│ in : reference + generated │
│ out: overall_score 0..1 │
│ per-axis scores (cast, clothing, pose, scene, │
│ composition, expression, color/lighting) │
│ diff_analysis (JSON: what's off + how to fix, │
│ phrased in Prompt-Builder axis vocabulary) │
└────────────────────┬──────────────────────────────────┘
│ score + diffs
┌────────────────────▼────────────────┐
│ CALIBRATOR / controller │
│ - accumulate per-axis scores │
│ - map diffs → axis adjustments │
│ - update Prompt-Builder knobs │
│ - stop when overall_score ≥ target │
│ or max iterations reached │
└──────────────────────────────────────┘
```
The novel piece is the **Judge node**. Offtheshelf QwenVL nodes emit free text;
a calibrator needs a **machinereadable score + peraxis diffs** so the controller
can act on them. That is what `nodes/qwen_judge.py` in this repo provides.
---
## 2. The VLLM node — what to reuse
You already have the model converted locally:
```
/media/p5/qwen3vl_4b_abliterated_comfy_convert/
├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
└── hf_fp8/ ← same model, FP8 (≈45 GB, trivially fits the RTX 5090 32 GB)
```
The **abliterated** variant matters: stock Qwen3VL will often refuse to "describe or
analyze" adult imagery, which would break the loop. huihuiai removed the textside
refusal direction, so it scores NSFW reference/generated pairs without bailing.
### Reusable ComfyUI nodes (pick one as the plumbing base)
| Repo | Backend | Multiimage | Local path | Notes |
|---|---|---|---|---|
| **[hardik-uppal/ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)** | transformers | ✅ `images` + `images_batch_2/3` | needs tiny tweak | **Best base** — built for "compare these images, describe the differences"; supports FP16 / 8bit / 4bit **and prequantized FP8** (matches your `hf_fp8`). |
| [IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct](https://github.com/IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct) | transformers | ✅ multiimage query | HF download | Clean native Qwen3VLInstruct integration. |
| [jren712/ComfyUI-QwenVL-abliterated](https://github.com/jren712/ComfyUI-QwenVL-abliterated) | transformers | ✅ | abliteratedoriented | Fork tuned for the abliterated weights. |
| [1038lab/ComfyUI-QwenVL](https://github.com/1038lab/ComfyUI-QwenVL) | **GGUF** (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. |
**Recommendation:** don't run any of them *asis* for the loop — they only output text.
Instead reuse their **modelload + `apply_chat_template` + `generate`** plumbing inside
a purposebuilt **Judge node** (this repo) that forces structured JSON output. The
`ComfyUI-QwenVL-MultiImage` loader is the closest template (it already handles two
image batches + FP8).
### Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL
As of June 2026 the **latest Qwen VL family is Qwen3VL** (Qwen3.5VL shipped early
2026, but abliterated builds of it are **textonly so far** — no uncensored
Qwen3.5*VL* yet). So "latest + uncensored + fits 32 GB" = **Qwen3VL30BA3B abliterated**.
All rows below are huihuiai abliterated (uncensored) weights:
| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict |
|---|---|---|---|
| **Qwen3VL30BA3BInstruct** ([HF](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated)) | **nf4 (4bit)** or GGUF Q4_K_M | ~18 GB | **Best judge that fits.** MoE → only 3B active, so it's fast despite 30B total. transformers class `Qwen3VLMoeForConditionalGeneration` (autodetected by the node). |
| Qwen3VL8BInstruct ([HF](https://huggingface.co/huihui-ai)) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; dropin for the judge node. |
| Qwen3VL4BInstruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. |
**Gemma alternative:** Gemma327Bit (abliterated, 4bit ~16 GB) is a solid different
visual prior if you want a second opinion, but the Krea2 text encoder + PromptBuilder
are already Qwenaligned, so staying on Qwen3VL keeps the vocabulary consistent.
Download an upgrade and point the node's `model_path` at it:
```bash
hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
--local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
# then in the Judge node: model_path=<that dir>, precision=nf4
```
Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always
coreside — run them as **separate queue steps** and let ComfyUI unload between; the loop
is sequential anyway. The 8B bf16 judge coresides more easily.
---
## 3. Scoring rubric (what the VLM actually returns)
The judge prompts Qwen3VL to return **strict JSON** with one overall score and a score
per axis, where the axes mirror what PromptBuilder can control. This is what makes the
diff *actionable* instead of generic prose.
```json
{
"overall_score": 0.0,
"axes": {
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"},
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"},
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"},
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
},
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
}
```
The axis list is **configurable** on the node so it can match whichever PromptBuilder
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
map each one onto a knob.
### Reducing VLMasjudge variance (important)
VLM scoring is noisy and biased. Mitigations baked into the node / recommended:
1. **Positionbias swap** — run the judge twice with reference/generated order swapped and
average the peraxis scores (`swap_eval=True`). Cuts the "first image wins" bias.
2. **Low temperature** (0.00.3) + a **fixed rubric** in the system prompt → repeatable scores.
3. **Anchored 01 rubric** (0 = unrelated, 0.5 = same category/different details, 1 = nearidentical) so scores are comparable across iterations.
4. **Evidencefirst**: ask the model to state the concrete difference *before* the number; reasoningthenscore is measurably more reliable than scorethenreasoning.
5. **Average over k T2I seeds** for the *same* prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, **freeze the T2I seed** during the axis search and only vary it once at the end.
---
## 4. The calibrator / controller
> **Chosen design: the controller is an external CLI agent, not an ingraph node.**
> The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into
> the `CalibratorPromptReceptor` node, and queues ComfyUI via its HTTP API — one
> `prompt_id` per iteration. See **[AGENT_LOOP.md](AGENT_LOOP.md)** and `agent_bridge.py`.
> The options below describe the *policy* the agent can run.
PromptBuilder is a **deterministic, seeded, combinatorial** generator (it is *not* an
LLM). So "calibration" = **searching the space of `(seed, profile, peraxis overrides)`**
to maximize `overall_score`. Three controller options, easiest → strongest:
1. **Greedy peraxis hillclimb (start here).**
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
Implementable today with the PromptBuilder **ForLoop Start/End + Accumulator** nodes.
2. **Blackbox optimizer over the knob vector.**
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES /
a simple bandit, objective = `overall_score`. Better for >34 interacting axes; needs
a thin Python controller node that holds state across iterations.
3. **LLMintheloop rewriter.**
Feed `diff_analysis` to a (local) text LLM that proposes the next knob settings (or,
if you move to freetext prompts, rewrites the prompt). Most flexible, least
reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.
**Loop hygiene:** fix resolution/sampler/steps across iterations; freeze T2I seed while
searching; stop on `overall_score ≥ target` (e.g. 0.85) **or** `max_iters`; log every
`(knobs, score, diff)` triple so the search is auditable and resumable.
---
## 5. Concrete build order
1. **Judge node** (this repo, `nodes/qwen_judge.py`) — load local Qwen3VL4B abliterated,
take ref+gen, output `overall_score (FLOAT)`, `axis_scores (JSON STRING)`,
`diff_analysis (STRING)`, `raw (STRING)`. ✅ scaffolded.
2. **Wire the loop** in a workflow: PromptBuilder → T2I → Judge → Accumulator, using the
SxCP ForLoop nodes; route `overall_score` into the loop's stop condition.
3. **Controller node** — start with greedy peraxis hillclimb that reads `diff_analysis`
and emits knob overrides back into PromptBuilder's split control nodes.
4. **Tune the judge** — calibrate the rubric on a handful of known ref/gen pairs; enable
`swap_eval`; pick temperature; decide if you need to step up to 8B/30BA3B.
See [README.md](../README.md) for install/usage of the Judge node.