Files
ComfyUI-Prompt-Calibrator/docs/METHODOLOGY.md
T
Ethanfel 959ec70065 Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions
The local VLM now only observes and scores; correction is left to the stronger
external agent. Each axis reports the target value (ref), the current value (gen)
and the closeness (score) — the target/current/distance an agent needs to
calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/
camera/render) so the action cluster stays discriminative for explicit content.
swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first;
default max_new_tokens 1024. Docs aligned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:52:40 +02:00

199 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Local Prompt Calibrator — Methodology
> Goal: a **fully local** ComfyUI feedback loop where a visionlanguage model (VLM)
> scores how close a *generated* image is to a *reference* image, and that score +
> a structured difference analysis is used to **calibrate the promptgeneration
> method** ([ComfyUIPromptBuilder](../../ComfyUI-Prompt-Builder), the "SxCP" nodes)
> until the generated image matches the reference.
---
## 1. The loop at a glance
```
┌──────────────────────────────────────────────┐
│ REFERENCE image (the target look) │
└───────────────┬──────────────────────────────┘
┌────────────────────▼────────────────┐ calibration deltas
│ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob
│ seeded pools + profile knobs │ overrides / seed move)
└────────────────────┬────────────────┘
│ prompt + negative
┌────────────────────▼────────────────┐
│ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while
└────────────────────┬────────────────┘ searching the prompt axes
│ generated image
┌────────────────────▼──────────────────────────────────┐
│ Qwen3-VL JUDGE node ── the "vllm node" │
│ in : reference + generated │
│ out: overall_score 0..1 │
│ per-axis {score, ref, gen} over ~20 axes │
│ (identity, body, wardrobe, action, affect, │
│ camera, render) — target vs current values │
│ (local model observes only; no fixes suggested) │
└────────────────────┬──────────────────────────────────┘
│ score + ref/gen per axis
┌────────────────────▼────────────────┐
│ CALIBRATOR / controller │
│ - accumulate per-axis scores │
│ - map diffs → axis adjustments │
│ - update Prompt-Builder knobs │
│ - stop when overall_score ≥ target │
│ or max iterations reached │
└──────────────────────────────────────┘
```
The novel piece is the **Judge node**. Offtheshelf QwenVL nodes emit free text;
a calibrator needs a **machinereadable score + peraxis diffs** so the controller
can act on them. That is what `nodes/qwen_judge.py` in this repo provides.
---
## 2. The VLLM node — what to reuse
You already have the model converted locally:
```
/media/p5/qwen3vl_4b_abliterated_comfy_convert/
├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
└── hf_fp8/ ← same model, FP8 (≈45 GB, trivially fits the RTX 5090 32 GB)
```
The **abliterated** variant matters: stock Qwen3VL will often refuse to "describe or
analyze" adult imagery, which would break the loop. huihuiai removed the textside
refusal direction, so it scores NSFW reference/generated pairs without bailing.
### Reusable ComfyUI nodes (pick one as the plumbing base)
| Repo | Backend | Multiimage | Local path | Notes |
|---|---|---|---|---|
| **[hardik-uppal/ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)** | transformers | ✅ `images` + `images_batch_2/3` | needs tiny tweak | **Best base** — built for "compare these images, describe the differences"; supports FP16 / 8bit / 4bit **and prequantized FP8** (matches your `hf_fp8`). |
| [IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct](https://github.com/IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct) | transformers | ✅ multiimage query | HF download | Clean native Qwen3VLInstruct integration. |
| [jren712/ComfyUI-QwenVL-abliterated](https://github.com/jren712/ComfyUI-QwenVL-abliterated) | transformers | ✅ | abliteratedoriented | Fork tuned for the abliterated weights. |
| [1038lab/ComfyUI-QwenVL](https://github.com/1038lab/ComfyUI-QwenVL) | **GGUF** (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. |
**Recommendation:** don't run any of them *asis* for the loop — they only output text.
Instead reuse their **modelload + `apply_chat_template` + `generate`** plumbing inside
a purposebuilt **Judge node** (this repo) that forces structured JSON output. The
`ComfyUI-QwenVL-MultiImage` loader is the closest template (it already handles two
image batches + FP8).
### Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL
As of June 2026 the **latest Qwen VL family is Qwen3VL** (Qwen3.5VL shipped early
2026, but abliterated builds of it are **textonly so far** — no uncensored
Qwen3.5*VL* yet). So "latest + uncensored + fits 32 GB" = **Qwen3VL30BA3B abliterated**.
All rows below are huihuiai abliterated (uncensored) weights:
| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict |
|---|---|---|---|
| **Qwen3VL30BA3BInstruct** ([HF](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated)) | **nf4 (4bit)** or GGUF Q4_K_M | ~18 GB | **Best judge that fits.** MoE → only 3B active, so it's fast despite 30B total. transformers class `Qwen3VLMoeForConditionalGeneration` (autodetected by the node). |
| Qwen3VL8BInstruct ([HF](https://huggingface.co/huihui-ai)) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; dropin for the judge node. |
| Qwen3VL4BInstruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. |
**Gemma alternative:** Gemma327Bit (abliterated, 4bit ~16 GB) is a solid different
visual prior if you want a second opinion, but the Krea2 text encoder + PromptBuilder
are already Qwenaligned, so staying on Qwen3VL keeps the vocabulary consistent.
Download an upgrade and point the node's `model_path` at it:
```bash
hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
--local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
# then in the Judge node: model_path=<that dir>, precision=nf4
```
Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always
coreside — run them as **separate queue steps** and let ComfyUI unload between; the loop
is sequential anyway. The 8B bf16 judge coresides more easily.
---
## 3. Scoring rubric (what the VLM actually returns)
The judge prompts Qwen3VL to return **strict JSON** with one overall score and, per axis,
the **target value (`ref`), the current value (`gen`), and the gap (`score`)** — exactly
the *target / current / distance* an agent needs to calibrate. The local model only
observes; it suggests no fixes (a stronger external model owns correction).
```json
{
"overall_score": 0.0,
"axes": {
"subject_count": {"score": 1.0, "ref": "1 woman", "gen": "1 woman"},
"position": {"score": 0.3, "ref": "doggy style", "gen": "missionary"},
"clothing_state":{"score": 0.4, "ref": "red lace lingerie", "gen": "nude"},
"scene": {"score": 0.5, "ref": "dim bedroom", "gen": "outdoor"},
"framing": {"score": 0.6, "ref": "full body", "gen": "close-up"},
"lighting_color":{"score": 0.5, "ref": "warm low-key", "gen": "flat daylight"}
}
}
```
The axis list is **configurable** on the node. The default ~20 axes are grouped as
identity / body / wardrobe / action / affect / camera / render, kept granular so the
*action* cluster (`sexual_act`, `position`, `penetration`, `explicitness`, `body_contact`)
stays discriminative for explicit content. The agent steers each low axis's prompt wording
toward its `ref` value. See [CALIBRATION_POLICY.md](CALIBRATION_POLICY.md).
### Reducing VLMasjudge variance (important)
VLM scoring is noisy and biased. Mitigations baked into the node / recommended:
1. **Positionbias swap** — run the judge twice with reference/generated order swapped and
average the peraxis scores (`swap_eval=True`). Cuts the "first image wins" bias.
2. **Low temperature** (0.00.3) + a **fixed rubric** in the system prompt → repeatable scores.
3. **Anchored 01 rubric** (0 = unrelated, 0.5 = same category/different details, 1 = nearidentical) so scores are comparable across iterations.
4. **Evidencefirst**: ask the model to state the concrete difference *before* the number; reasoningthenscore is measurably more reliable than scorethenreasoning.
5. **Average over k T2I seeds** for the *same* prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, **freeze the T2I seed** during the axis search and only vary it once at the end.
---
## 4. The calibrator / controller
> **Chosen design: the controller is an external CLI agent, not an ingraph node.**
> The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into
> the `CalibratorPromptReceptor` node, and queues ComfyUI via its HTTP API — one
> `prompt_id` per iteration. See **[AGENT_LOOP.md](AGENT_LOOP.md)** and `agent_bridge.py`.
> The options below describe the *policy* the agent can run.
PromptBuilder is a **deterministic, seeded, combinatorial** generator (it is *not* an
LLM). So "calibration" = **searching the space of `(seed, profile, peraxis overrides)`**
to maximize `overall_score`. Three controller options, easiest → strongest:
1. **Greedy peraxis hillclimb (start here).**
Take the lowestscoring axis, rewrite that axis's prompt wording toward its `ref`
(target) value, regenerate, keep the change if `overall_score` improved, else revert.
Loop until ≥ target or no axis improves. The agent decides the wording (no machine
fixes). Implementable with the PromptBuilder **ForLoop Start/End + Accumulator** nodes.
2. **Blackbox optimizer over the knob vector.**
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES /
a simple bandit, objective = `overall_score`. Better for >34 interacting axes; needs
a thin Python controller node that holds state across iterations.
3. **LLMintheloop rewriter.**
Feed `diff_analysis` to a (local) text LLM that proposes the next knob settings (or,
if you move to freetext prompts, rewrites the prompt). Most flexible, least
reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.
**Loop hygiene:** fix resolution/sampler/steps across iterations; freeze T2I seed while
searching; stop on `overall_score ≥ target` (e.g. 0.85) **or** `max_iters`; log every
`(knobs, score, diff)` triple so the search is auditable and resumable.
---
## 5. Concrete build order
1. **Judge node** (this repo, `nodes/qwen_judge.py`) — load local Qwen3VL4B abliterated,
take ref+gen, output `overall_score (FLOAT)`, `axis_scores (JSON STRING)`,
`diff_analysis (STRING)`, `raw (STRING)`. ✅ scaffolded.
2. **Wire the loop** in a workflow: PromptBuilder → T2I → Judge → Accumulator, using the
SxCP ForLoop nodes; route `overall_score` into the loop's stop condition.
3. **Controller node** — start with greedy peraxis hillclimb that reads `diff_analysis`
and emits knob overrides back into PromptBuilder's split control nodes.
4. **Tune the judge** — calibrate the rubric on a handful of known ref/gen pairs; enable
`swap_eval`; pick temperature; decide if you need to step up to 8B/30BA3B.
See [README.md](../README.md) for install/usage of the Judge node.