T

Ethanfel 959ec70065 Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger
external agent. Each axis reports the target value (ref), the current value (gen)
and the closeness (score) — the target/current/distance an agent needs to
calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/
camera/render) so the action cluster stays discriminative for explicit content.
swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first;
default max_new_tokens 1024. Docs aligned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-26 22:52:40 +02:00

docs

Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

2026-06-26 22:52:40 +02:00

nodes

Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

2026-06-26 22:52:40 +02:00

workflow

Initial commit: VLM-as-judge prompt calibration loop

2026-06-26 22:15:56 +02:00

__init__.py

Initial commit: VLM-as-judge prompt calibration loop

2026-06-26 22:15:56 +02:00

.gitignore

Initial commit: VLM-as-judge prompt calibration loop

2026-06-26 22:15:56 +02:00

agent_bridge.py

Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

2026-06-26 22:52:40 +02:00

pyproject.toml

Initial commit: VLM-as-judge prompt calibration loop

2026-06-26 22:15:56 +02:00

README.md

Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

2026-06-26 22:52:40 +02:00

requirements.txt

Initial commit: VLM-as-judge prompt calibration loop

2026-06-26 22:15:56 +02:00

README.md

ComfyUI-Prompt-Calibratror

A fully local prompt calibration loop for ComfyUI. A vision-language model (Qwen3-VL) judges how close a generated image is to a reference image and returns a structured score + per-axis difference analysis, which is used to calibrate the prompt-generation method (ComfyUI-Prompt-Builder) until the generated image matches the reference.

Full design rationale, controller options, and VLM-as-judge variance mitigations are in docs/METHODOLOGY.md. The controller is an external CLI agent that drives ComfyUI via its HTTP API — see docs/AGENT_LOOP.md.

Nodes & tools

Component	What it is
`Qwen3-VL Image Judge (Calibrator)`	scores generated vs reference, writes analysis to disk for the agent
`SxCP External Prompt (Receptor)`	stable injection point; the agent sets `prompt/negative/seed` here per queue
`agent_bridge.py`	one CLI call = one iteration (inject → `POST /prompt` → wait → print analysis JSON)

The "vllm node": `Qwen3-VL Image Judge (Calibrator)`

The core node (nodes/qwen_judge.py). It reuses the standard transformers Qwen3-VL inference plumbing (same approach as ComfyUI-QwenVL-MultiImage — the recommended reuse base) but forces strict JSON output so an automated loop can act on it.

Inputs

name	type	default	notes
`reference_image`	IMAGE	—	the target
`generated_image`	IMAGE	—	the candidate to score
`model_path`	STRING	`/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16`	local dir, HF repo id (`org/name`), or alias (`30b-a3b` / `8b` / `4b`)
`precision`	bf16 / fp16 / fp8 / nf4	bf16	`nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy
`axes`	STRING	~20 axes (identity, body, wardrobe, action, affect, camera, render)	scored axes; granular for explicit content. Edit to taste
`max_new_tokens`	INT	512
`temperature`	FLOAT	0.0	0 = greedy/repeatable
`swap_eval`	BOOL	true	run twice with images swapped, average → cuts position bias
`keep_loaded`	BOOL	true	cache weights across loop iterations
`auto_download`	BOOL	true	if `model_path` is a repo id/alias and not local, fetch it from HF into `models/prompt_generator/`

Auto-download: set model_path to 30b-a3b (alias) or any org/name repo id and leave auto_download on — the node snapshot-downloads it on first run (into ComfyUI's models/prompt_generator/<name>) and reuses the local copy afterward. Local paths and the default skip download entirely.

Outputs

name	type	use
`overall_score`	FLOAT 0..1	loop stop-condition / objective
`axis_scores_json`	STRING (JSON)	per-axis `{score, ref, gen}` — target vs current, for the agent
`diff_analysis`	STRING	readable summary, worst axes first (`score ref:[…] gen:[…]`)
`raw`	STRING	raw model output (both passes if `swap_eval`)

Install

cd /media/p5/Comfyui/custom_nodes
ln -s /media/p5/ComfyUI-Prompt-Calibratror .     # or git clone
/media/p5/Comfyui/venv/bin/pip install -r /media/p5/ComfyUI-Prompt-Calibratror/requirements.txt

The node defaults to the huihui-ai Qwen3-VL-4B-Instruct abliterated weights already converted at /media/p5/qwen3vl_4b_abliterated_comfy_convert/ so it runs out of the box (the abliterated/uncensored variant won't refuse to analyze adult imagery, which would otherwise break the loop).

Recommended upgrade (latest Qwen VL + uncensored, fits 32 GB): huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated — MoE (3B active, fast), run at precision=nf4 (~18 GB). The node auto-detects the MoE class. An easier middle ground is the 8B abliterated at bf16 (~17 GB, no quantization). Qwen3.5-VL abliterated isn't out yet (Qwen3.5 abliterated builds are text-only so far); Gemma-3-27B-it abliterated (4-bit) is a viable non-Qwen alternative. See docs/METHODOLOGY.md.

Loop sketch

Prompt-Builder (SxCP) ──prompt──▶ T2I (SDXL/Flux/Krea2) ──image──▶ Qwen3-VL Image Judge
        ▲                                                                │
        └──────── knob overrides ◀── Controller ◀── overall_score + diff ┘

Use the Prompt-Builder For-Loop Start/End + Accumulator nodes to drive iterations and route overall_score into the stop condition. Controller options (greedy hill-climb → black-box optimizer → LLM-in-the-loop) are in the methodology doc.

End-to-end loop

Run ComfyUI with --listen, install this node pack, put your reference at ComfyUI/input/reference.png.
Load workflow/workflow_api.json (SDXL waiIllustriousSDXL_v160 example — swap the checkpoint for Flux/Krea as needed).

Drive it from your agent following docs/CALIBRATION_POLICY.md:

python agent_bridge.py --workflow workflow/workflow_api.json \
  --prompt "1 woman, red lingerie, bedroom, full body, warm light" \
  --run-tag iter001 --analysis-dir /media/p5/Comfyui/output/calibrator

stdout = the analysis JSON → agent calibrates → next iteration.

Status

Methodology + node selection (docs/METHODOLOGY.md)
Qwen3-VL Image Judge node (structured JSON scoring, swap-eval, model caching, file report)
Agent-driven architecture (docs/AGENT_LOOP.md) — Receptor node + agent_bridge.py
Example end-to-end workflow (workflow/workflow_api.json)
Agent calibration policy (docs/CALIBRATION_POLICY.md)
Optional: structured-config receptor (carry Prompt-Builder knobs instead of a flat string)

README.md

ComfyUI-Prompt-Calibratror

Nodes & tools

The "vllm node": Qwen3-VL Image Judge (Calibrator)

Install

Loop sketch

End-to-end loop

Status

The "vllm node": `Qwen3-VL Image Judge (Calibrator)`