959ec70065
The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
199 lines
12 KiB
Markdown
199 lines
12 KiB
Markdown
# Local Prompt Calibrator — Methodology
|
||
|
||
> Goal: a **fully local** ComfyUI feedback loop where a vision‑language model (VLM)
|
||
> scores how close a *generated* image is to a *reference* image, and that score +
|
||
> a structured difference analysis is used to **calibrate the prompt‑generation
|
||
> method** ([ComfyUI‑Prompt‑Builder](../../ComfyUI-Prompt-Builder), the "SxCP" nodes)
|
||
> until the generated image matches the reference.
|
||
|
||
---
|
||
|
||
## 1. The loop at a glance
|
||
|
||
```
|
||
┌──────────────────────────────────────────────┐
|
||
│ REFERENCE image (the target look) │
|
||
└───────────────┬──────────────────────────────┘
|
||
│
|
||
┌────────────────────▼────────────────┐ calibration deltas
|
||
│ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob
|
||
│ seeded pools + profile knobs │ overrides / seed move)
|
||
└────────────────────┬────────────────┘
|
||
│ prompt + negative
|
||
┌────────────────────▼────────────────┐
|
||
│ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while
|
||
└────────────────────┬────────────────┘ searching the prompt axes
|
||
│ generated image
|
||
┌────────────────────▼──────────────────────────────────┐
|
||
│ Qwen3-VL JUDGE node ── the "vllm node" │
|
||
│ in : reference + generated │
|
||
│ out: overall_score 0..1 │
|
||
│ per-axis {score, ref, gen} over ~20 axes │
|
||
│ (identity, body, wardrobe, action, affect, │
|
||
│ camera, render) — target vs current values │
|
||
│ (local model observes only; no fixes suggested) │
|
||
└────────────────────┬──────────────────────────────────┘
|
||
│ score + ref/gen per axis
|
||
┌────────────────────▼────────────────┐
|
||
│ CALIBRATOR / controller │
|
||
│ - accumulate per-axis scores │
|
||
│ - map diffs → axis adjustments │
|
||
│ - update Prompt-Builder knobs │
|
||
│ - stop when overall_score ≥ target │
|
||
│ or max iterations reached │
|
||
└──────────────────────────────────────┘
|
||
```
|
||
|
||
The novel piece is the **Judge node**. Off‑the‑shelf Qwen‑VL nodes emit free text;
|
||
a calibrator needs a **machine‑readable score + per‑axis diffs** so the controller
|
||
can act on them. That is what `nodes/qwen_judge.py` in this repo provides.
|
||
|
||
---
|
||
|
||
## 2. The VLLM node — what to reuse
|
||
|
||
You already have the model converted locally:
|
||
|
||
```
|
||
/media/p5/qwen3vl_4b_abliterated_comfy_convert/
|
||
├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
|
||
└── hf_fp8/ ← same model, FP8 (≈4–5 GB, trivially fits the RTX 5090 32 GB)
|
||
```
|
||
|
||
The **abliterated** variant matters: stock Qwen3‑VL will often refuse to "describe or
|
||
analyze" adult imagery, which would break the loop. huihui‑ai removed the text‑side
|
||
refusal direction, so it scores NSFW reference/generated pairs without bailing.
|
||
|
||
### Reusable ComfyUI nodes (pick one as the plumbing base)
|
||
|
||
| Repo | Backend | Multi‑image | Local path | Notes |
|
||
|---|---|---|---|---|
|
||
| **[hardik-uppal/ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)** | transformers | ✅ `images` + `images_batch_2/3` | needs tiny tweak | **Best base** — built for "compare these images, describe the differences"; supports FP16 / 8‑bit / 4‑bit **and pre‑quantized FP8** (matches your `hf_fp8`). |
|
||
| [IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct](https://github.com/IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct) | transformers | ✅ multi‑image query | HF download | Clean native Qwen3‑VL‑Instruct integration. |
|
||
| [jren712/ComfyUI-QwenVL-abliterated](https://github.com/jren712/ComfyUI-QwenVL-abliterated) | transformers | ✅ | abliterated‑oriented | Fork tuned for the abliterated weights. |
|
||
| [1038lab/ComfyUI-QwenVL](https://github.com/1038lab/ComfyUI-QwenVL) | **GGUF** (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. |
|
||
|
||
**Recommendation:** don't run any of them *as‑is* for the loop — they only output text.
|
||
Instead reuse their **model‑load + `apply_chat_template` + `generate`** plumbing inside
|
||
a purpose‑built **Judge node** (this repo) that forces structured JSON output. The
|
||
`ComfyUI-QwenVL-MultiImage` loader is the closest template (it already handles two
|
||
image batches + FP8).
|
||
|
||
### Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL
|
||
|
||
As of June 2026 the **latest Qwen VL family is Qwen3‑VL** (Qwen3.5‑VL shipped early
|
||
2026, but abliterated builds of it are **text‑only so far** — no uncensored
|
||
Qwen3.5‑*VL* yet). So "latest + uncensored + fits 32 GB" = **Qwen3‑VL‑30B‑A3B abliterated**.
|
||
All rows below are huihui‑ai abliterated (uncensored) weights:
|
||
|
||
| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict |
|
||
|---|---|---|---|
|
||
| **Qwen3‑VL‑30B‑A3B‑Instruct** ([HF](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated)) | **nf4 (4‑bit)** or GGUF Q4_K_M | ~18 GB | **Best judge that fits.** MoE → only 3B active, so it's fast despite 30B total. transformers class `Qwen3VLMoeForConditionalGeneration` (auto‑detected by the node). |
|
||
| Qwen3‑VL‑8B‑Instruct ([HF](https://huggingface.co/huihui-ai)) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; drop‑in for the judge node. |
|
||
| Qwen3‑VL‑4B‑Instruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. |
|
||
|
||
**Gemma alternative:** Gemma‑3‑27B‑it (abliterated, 4‑bit ~16 GB) is a solid different
|
||
visual prior if you want a second opinion, but the Krea2 text encoder + Prompt‑Builder
|
||
are already Qwen‑aligned, so staying on Qwen3‑VL keeps the vocabulary consistent.
|
||
|
||
Download an upgrade and point the node's `model_path` at it:
|
||
```bash
|
||
hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
|
||
--local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
|
||
# then in the Judge node: model_path=<that dir>, precision=nf4
|
||
```
|
||
|
||
Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always
|
||
co‑reside — run them as **separate queue steps** and let ComfyUI unload between; the loop
|
||
is sequential anyway. The 8B bf16 judge co‑resides more easily.
|
||
|
||
---
|
||
|
||
## 3. Scoring rubric (what the VLM actually returns)
|
||
|
||
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and, per axis,
|
||
the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
|
||
the *target / current / distance* an agent needs to calibrate. The local model only
|
||
observes; it suggests no fixes (a stronger external model owns correction).
|
||
|
||
```json
|
||
{
|
||
"overall_score": 0.0,
|
||
"axes": {
|
||
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
|
||
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
|
||
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
|
||
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
|
||
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
|
||
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
|
||
}
|
||
}
|
||
```
|
||
|
||
The axis list is **configurable** on the node. The default ~20 axes are grouped as
|
||
identity / body / wardrobe / action / affect / camera / render, kept granular so the
|
||
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
|
||
stays discriminative for explicit content. The agent steers each low axis's prompt wording
|
||
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
|
||
|
||
### Reducing VLM‑as‑judge variance (important)
|
||
|
||
VLM scoring is noisy and biased. Mitigations baked into the node / recommended:
|
||
|
||
1. **Position‑bias swap** — run the judge twice with reference/generated order swapped and
|
||
average the per‑axis scores (`swap_eval=True`). Cuts the "first image wins" bias.
|
||
2. **Low temperature** (0.0–0.3) + a **fixed rubric** in the system prompt → repeatable scores.
|
||
3. **Anchored 0–1 rubric** (0 = unrelated, 0.5 = same category/different details, 1 = near‑identical) so scores are comparable across iterations.
|
||
4. **Evidence‑first**: ask the model to state the concrete difference *before* the number; reasoning‑then‑score is measurably more reliable than score‑then‑reasoning.
|
||
5. **Average over k T2I seeds** for the *same* prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, **freeze the T2I seed** during the axis search and only vary it once at the end.
|
||
|
||
---
|
||
|
||
## 4. The calibrator / controller
|
||
|
||
> **Chosen design: the controller is an external CLI agent, not an in‑graph node.**
|
||
> The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into
|
||
> the `CalibratorPromptReceptor` node, and queues ComfyUI via its HTTP API — one
|
||
> `prompt_id` per iteration. See **[AGENT_LOOP.md](AGENT_LOOP.md)** and `agent_bridge.py`.
|
||
> The options below describe the *policy* the agent can run.
|
||
|
||
Prompt‑Builder is a **deterministic, seeded, combinatorial** generator (it is *not* an
|
||
LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis overrides)`**
|
||
to maximize `overall_score`. Three controller options, easiest → strongest:
|
||
|
||
1. **Greedy per‑axis hill‑climb (start here).**
|
||
Take the lowest‑scoring axis, rewrite that axis's prompt wording toward its `ref`
|
||
(target) value, regenerate, keep the change if `overall_score` improved, else revert.
|
||
Loop until ≥ target or no axis improves. The agent decides the wording (no machine
|
||
fixes). Implementable with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
||
|
||
2. **Black‑box optimizer over the knob vector.**
|
||
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /
|
||
a simple bandit, objective = `overall_score`. Better for >3–4 interacting axes; needs
|
||
a thin Python controller node that holds state across iterations.
|
||
|
||
3. **LLM‑in‑the‑loop rewriter.**
|
||
Feed `diff_analysis` to a (local) text LLM that proposes the next knob settings (or,
|
||
if you move to free‑text prompts, rewrites the prompt). Most flexible, least
|
||
reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.
|
||
|
||
**Loop hygiene:** fix resolution/sampler/steps across iterations; freeze T2I seed while
|
||
searching; stop on `overall_score ≥ target` (e.g. 0.85) **or** `max_iters`; log every
|
||
`(knobs, score, diff)` triple so the search is auditable and resumable.
|
||
|
||
---
|
||
|
||
## 5. Concrete build order
|
||
|
||
1. **Judge node** (this repo, `nodes/qwen_judge.py`) — load local Qwen3‑VL‑4B abliterated,
|
||
take ref+gen, output `overall_score (FLOAT)`, `axis_scores (JSON STRING)`,
|
||
`diff_analysis (STRING)`, `raw (STRING)`. ✅ scaffolded.
|
||
2. **Wire the loop** in a workflow: Prompt‑Builder → T2I → Judge → Accumulator, using the
|
||
SxCP For‑Loop nodes; route `overall_score` into the loop's stop condition.
|
||
3. **Controller node** — start with greedy per‑axis hill‑climb that reads `diff_analysis`
|
||
and emits knob overrides back into Prompt‑Builder's split control nodes.
|
||
4. **Tune the judge** — calibrate the rubric on a handful of known ref/gen pairs; enable
|
||
`swap_eval`; pick temperature; decide if you need to step up to 8B/30B‑A3B.
|
||
|
||
See [README.md](../README.md) for install/usage of the Judge node.
|