Files
ComfyUI-Prompt-Calibrator/docs/METHODOLOGY.md
T
Ethanfel 95198a15b5 Initial commit: VLM-as-judge prompt calibration loop
Qwen3-VL image-similarity judge node, external-prompt receptor node,
agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/
calibration-policy docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:15:56 +02:00

12 KiB
Raw Blame History

Local Prompt Calibrator — Methodology

Goal: a fully local ComfyUI feedback loop where a visionlanguage model (VLM) scores how close a generated image is to a reference image, and that score + a structured difference analysis is used to calibrate the promptgeneration method (ComfyUIPromptBuilder, the "SxCP" nodes) until the generated image matches the reference.


1. The loop at a glance

        ┌──────────────────────────────────────────────┐
        │  REFERENCE image  (the target look)           │
        └───────────────┬──────────────────────────────┘
                        │
   ┌────────────────────▼────────────────┐   calibration deltas
   │ Prompt-Builder (SxCP)  ── "method"  │◄──── (axis nudges / knob
   │  seeded pools + profile knobs       │       overrides / seed move)
   └────────────────────┬────────────────┘
                        │ prompt + negative
   ┌────────────────────▼────────────────┐
   │ T2I model  (SDXL / Flux / Krea2)    │   ← fix the sampler seed while
   └────────────────────┬────────────────┘     searching the prompt axes
                        │ generated image
   ┌────────────────────▼──────────────────────────────────┐
   │ Qwen3-VL JUDGE node  ── the "vllm node"                │
   │  in : reference + generated                            │
   │  out: overall_score 0..1                               │
   │       per-axis scores  (cast, clothing, pose, scene,   │
   │         composition, expression, color/lighting)       │
   │       diff_analysis (JSON: what's off + how to fix,    │
   │         phrased in Prompt-Builder axis vocabulary)     │
   └────────────────────┬──────────────────────────────────┘
                        │ score + diffs
   ┌────────────────────▼────────────────┐
   │ CALIBRATOR / controller             │
   │  - accumulate per-axis scores        │
   │  - map diffs → axis adjustments      │
   │  - update Prompt-Builder knobs        │
   │  - stop when overall_score ≥ target   │
   │    or max iterations reached          │
   └──────────────────────────────────────┘

The novel piece is the Judge node. Offtheshelf QwenVL nodes emit free text; a calibrator needs a machinereadable score + peraxis diffs so the controller can act on them. That is what nodes/qwen_judge.py in this repo provides.


2. The VLLM node — what to reuse

You already have the model converted locally:

/media/p5/qwen3vl_4b_abliterated_comfy_convert/
  ├── hf_bf16/   ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
  └── hf_fp8/    ← same model, FP8 (≈45 GB, trivially fits the RTX 5090 32 GB)

The abliterated variant matters: stock Qwen3VL will often refuse to "describe or analyze" adult imagery, which would break the loop. huihuiai removed the textside refusal direction, so it scores NSFW reference/generated pairs without bailing.

Reusable ComfyUI nodes (pick one as the plumbing base)

Repo Backend Multiimage Local path Notes
hardik-uppal/ComfyUI-QwenVL-MultiImage transformers images + images_batch_2/3 needs tiny tweak Best base — built for "compare these images, describe the differences"; supports FP16 / 8bit / 4bit and prequantized FP8 (matches your hf_fp8).
IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct transformers multiimage query HF download Clean native Qwen3VLInstruct integration.
jren712/ComfyUI-QwenVL-abliterated transformers abliteratedoriented Fork tuned for the abliterated weights.
1038lab/ComfyUI-QwenVL GGUF (llama.cpp) local GGUF Use only if you want GGUF; bf16 4B on 32 GB doesn't need it.

Recommendation: don't run any of them asis for the loop — they only output text. Instead reuse their modelload + apply_chat_template + generate plumbing inside a purposebuilt Judge node (this repo) that forces structured JSON output. The ComfyUI-QwenVL-MultiImage loader is the closest template (it already handles two image batches + FP8).

Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL

As of June 2026 the latest Qwen VL family is Qwen3VL (Qwen3.5VL shipped early 2026, but abliterated builds of it are textonly so far — no uncensored Qwen3.5VL yet). So "latest + uncensored + fits 32 GB" = Qwen3VL30BA3B abliterated. All rows below are huihuiai abliterated (uncensored) weights:

Model (abliterated) Best precision on 32 GB ~VRAM Verdict
Qwen3VL30BA3BInstruct (HF) nf4 (4bit) or GGUF Q4_K_M ~18 GB Best judge that fits. MoE → only 3B active, so it's fast despite 30B total. transformers class Qwen3VLMoeForConditionalGeneration (autodetected by the node).
Qwen3VL8BInstruct (HF) bf16 ~17 GB Easy middle ground, no quantization. Clearly better than 4B; dropin for the judge node.
Qwen3VL4BInstruct (already local) fp8 / bf16 ~5 / ~9 GB Lightweight fallback / fast iteration.

Gemma alternative: Gemma327Bit (abliterated, 4bit ~16 GB) is a solid different visual prior if you want a second opinion, but the Krea2 text encoder + PromptBuilder are already Qwenaligned, so staying on Qwen3VL keeps the vocabulary consistent.

Download an upgrade and point the node's model_path at it:

hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
  --local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
# then in the Judge node: model_path=<that dir>, precision=nf4

Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always coreside — run them as separate queue steps and let ComfyUI unload between; the loop is sequential anyway. The 8B bf16 judge coresides more easily.


3. Scoring rubric (what the VLM actually returns)

The judge prompts Qwen3VL to return strict JSON with one overall score and a score per axis, where the axes mirror what PromptBuilder can control. This is what makes the diff actionable instead of generic prose.

{
  "overall_score": 0.0,
  "axes": {
    "cast":        {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
    "clothing":    {"score": 0.0, "diff": "ref lingerie vs gen nude"},
    "pose":        {"score": 0.0, "diff": "ref standing vs gen seated"},
    "scene":       {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
    "composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
    "expression":  {"score": 0.0, "diff": "ref smiling vs gen neutral"},
    "color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
  },
  "fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
}

The axis list is configurable on the node so it can match whichever PromptBuilder knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression, color/lighting). fix_suggestions is phrased in axis vocabulary so the controller can map each one onto a knob.

Reducing VLMasjudge variance (important)

VLM scoring is noisy and biased. Mitigations baked into the node / recommended:

  1. Positionbias swap — run the judge twice with reference/generated order swapped and average the peraxis scores (swap_eval=True). Cuts the "first image wins" bias.
  2. Low temperature (0.00.3) + a fixed rubric in the system prompt → repeatable scores.
  3. Anchored 01 rubric (0 = unrelated, 0.5 = same category/different details, 1 = nearidentical) so scores are comparable across iterations.
  4. Evidencefirst: ask the model to state the concrete difference before the number; reasoningthenscore is measurably more reliable than scorethenreasoning.
  5. Average over k T2I seeds for the same prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, freeze the T2I seed during the axis search and only vary it once at the end.

4. The calibrator / controller

Chosen design: the controller is an external CLI agent, not an ingraph node. The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into the CalibratorPromptReceptor node, and queues ComfyUI via its HTTP API — one prompt_id per iteration. See AGENT_LOOP.md and agent_bridge.py. The options below describe the policy the agent can run.

PromptBuilder is a deterministic, seeded, combinatorial generator (it is not an LLM). So "calibration" = searching the space of (seed, profile, peraxis overrides) to maximize overall_score. Three controller options, easiest → strongest:

  1. Greedy peraxis hillclimb (start here). For each axis with the lowest score, apply the matching fix_suggestion as a knob override (e.g. set clothing=lingerie, cast_women=1), regenerate, keep the change if overall_score improved, else revert. Loop until ≥ target or no axis improves. Implementable today with the PromptBuilder ForLoop Start/End + Accumulator nodes.

  2. Blackbox optimizer over the knob vector. Encode the exposed knobs as a parameter vector and drive it with Optuna / CMAES / a simple bandit, objective = overall_score. Better for >34 interacting axes; needs a thin Python controller node that holds state across iterations.

  3. LLMintheloop rewriter. Feed diff_analysis to a (local) text LLM that proposes the next knob settings (or, if you move to freetext prompts, rewrites the prompt). Most flexible, least reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.

Loop hygiene: fix resolution/sampler/steps across iterations; freeze T2I seed while searching; stop on overall_score ≥ target (e.g. 0.85) or max_iters; log every (knobs, score, diff) triple so the search is auditable and resumable.


5. Concrete build order

  1. Judge node (this repo, nodes/qwen_judge.py) — load local Qwen3VL4B abliterated, take ref+gen, output overall_score (FLOAT), axis_scores (JSON STRING), diff_analysis (STRING), raw (STRING). scaffolded.
  2. Wire the loop in a workflow: PromptBuilder → T2I → Judge → Accumulator, using the SxCP ForLoop nodes; route overall_score into the loop's stop condition.
  3. Controller node — start with greedy peraxis hillclimb that reads diff_analysis and emits knob overrides back into PromptBuilder's split control nodes.
  4. Tune the judge — calibrate the rubric on a handful of known ref/gen pairs; enable swap_eval; pick temperature; decide if you need to step up to 8B/30BA3B.

See README.md for install/usage of the Judge node.