Files

T

Ethanfel 8b567cb531 chat mode: json_output toggle to return clean extracted JSON

For JSON-producing system prompts (e.g. LTX prompt-relay), json_output=true pulls
the JSON object out of the reply (strips reasoning/prose/code-fences via _parse_json,
which handles nested schemas and reasoning-then-JSON) and returns it re-serialized;
falls back to raw text if none parses. agent_bridge gains --json-output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-07-02 02:09:36 +02:00

11 KiB

Raw Permalink Blame History

ComfyUI-Prompt-Calibratror

A fully local prompt calibration loop for ComfyUI. A vision-language model (Qwen3-VL) judges how close a generated image is to a reference image and returns a structured score + per-axis difference analysis, which is used to calibrate the prompt-generation method (ComfyUI-Prompt-Builder) until the generated image matches the reference.

Full design rationale, controller options, and VLM-as-judge variance mitigations are in docs/METHODOLOGY.md. The controller is an external CLI agent that drives ComfyUI via its HTTP API — see docs/AGENT_LOOP.md.

Nodes & tools

Component	What it is
`Qwen3-VL Image Judge (Calibrator)`	scores generated vs reference, writes analysis to disk for the agent
`SxCP External Prompt (Receptor)`	stable injection point; the agent sets `prompt/negative/seed` here per queue
`agent_bridge.py`	one CLI call = one iteration (inject → `POST /prompt` → wait → print analysis JSON)

The "vllm node": `Qwen3-VL Image Judge (Calibrator)`

The core node (nodes/qwen_judge.py). It reuses the standard transformers Qwen3-VL inference plumbing (same approach as ComfyUI-QwenVL-MultiImage — the recommended reuse base) but forces strict JSON output so an automated loop can act on it.

Inputs

name	type	default	notes
`reference_image`	IMAGE	—	the target
`mode`	compare / describe / chat	compare	`compare` = score ref vs generated. `describe` = first pass over the reference → caption + target spec. `chat` = general VLM: your `system_prompt` + `user_prompt` over the image(s) → raw text
`profile`	general / oral / penetration / handjob / solo	general	analysis profile — act-specialized axis set; the act-critical axes are distance/proximity-aware (e.g. `mouth_genital_distance`) so magnitude isn't hidden behind a coarse label
`generated_image`	IMAGE (optional)	—	the candidate to score (required for `compare`, ignored for `describe`)
`model_select`	dropdown (model name)	4B local	which judge (transformers/safetensors, auto-downloaded): Qwen3-VL 4B/8B/30B-A3B, Qwen3.5-9B, Qwen3.6-27B/35B-A3B (newer, natively multimodal). Param size shown in the label
`precision`	bf16 / fp8 / nf4	bf16	the quant — applies to the selected model (VRAM table below)
`model_path`	STRING	"" (empty)	manual override of the dropdown — local dir, HF repo id, or alias (`8b`/`30b-a3b`/`3.5-9b`/`3.6-27b`/`3.6-35b`). Empty = use `model_select`
`axes`	STRING input	—	(socket) optional override of the profile's axis set; wire a text node or leave unconnected to use `profile`
`max_new_tokens`	INT	3072	reasoning models (Qwen3.5/3.6) need room; raise it if the verdict gets cut off
`enable_thinking`	BOOL	true	let the model reason before judging. Keep on for accurate verdicts — off makes reasoning models rubber-stamp `match`. Off is faster
`temperature`	FLOAT	0.0	0 = greedy/repeatable
`swap_eval`	BOOL	true	run twice with images swapped, average → cuts position bias
`keep_loaded`	BOOL	true	cache weights across loop iterations
`auto_download`	BOOL	true	if `model_path` is a repo id/alias and not local, fetch it from HF into `models/prompt_generator/`
`system_prompt`	STRING input	—	(socket) chat mode: wire your system prompt from a text node
`user_prompt`	STRING input	—	(socket) chat mode: wire your instruction from a text node
`reference_description`	STRING input	—	(socket) compare: wire describe's canonical output here to anchor the reference

Auto-download: set model_path to 30b-a3b (alias) or any org/name repo id and leave auto_download on — the node snapshot-downloads it on first run (into ComfyUI's models/prompt_generator/<name>) and reuses the local copy afterward. Local paths and the default skip download entirely.

General VLM (chat mode): set mode=chat and the node becomes a plain vision-language node — feed an image (and optionally a second), write your own system_prompt/user_prompt, and read the model's text from the analysis output. Reuses the same model dropdown, quant, and auto-download as the judge, so it's a one-node abliterated VLM for captioning, tagging, Q&A, prompt-from-image, etc. (CLI: agent_bridge.py --mode chat --user-prompt "..."). Set json_output=true for JSON-producing system prompts — it extracts the JSON object from the reply (stripping any reasoning, prose, or ```fences) and returns it clean and re-serialized (falls back to raw text if none parses). Works even with enable_thinking on.

Performance / speed

This node runs models through transformers .generate() — the simplest path, but the slowest: no PagedAttention / continuous batching / fused kernels like vLLM, SGLang, or llama.cpp. With enable_thinking on, the model also emits thousands of reasoning tokens (each token = one forward pass) — that's the cost of accurate verdicts. Levers, fastest first:

swap_eval = false — halves the work (one reasoned pass instead of two). Biggest free win.
flash-attention — the node auto-uses flash_attention_2 if flash-attn is installed, else sdpa. pip install flash-attn for the speedup.
smaller model / fewer axes — Qwen3.5-9B bf16 over the 27B/35B; trim axes or use a focused profile.
enable_thinking = false — much faster, but reasoning models then rubber-stamp match; only for quick smoke tests.
avoid nf4 for speed — bitsandbytes dequantizes every step; bf16/fp8 decode faster (nf4 is for fitting the big models, not speed).

The real fix for production speed is a different inference engine (vLLM/SGLang serve these models many× faster) — a heavier, separate-server setup not built into this node.

Outputs

name	type	use
`overall_score`	FLOAT 0..1	compare: mean verdict (computed here, not by the model). describe: `1.0` placeholder
`axis_scores_json`	STRING (JSON)	compare: per-axis `{verdict, ref, gen}` (verdict = match/partial/mismatch). describe: `{axis: value}`
`analysis`	STRING	compare: header (`overall, N mismatches`) + axes worst-first (`VERDICT ref:[…] gen:[…]`). describe: the `caption`. chat: the model's response
`raw`	STRING	raw model output (both passes if `swap_eval`)
`report_path`	STRING	path to the written `calib_<tag>.json` (carries `mismatch_count`)

Install

cd /media/p5/Comfyui/custom_nodes
ln -s /media/p5/ComfyUI-Prompt-Calibratror .     # or git clone
/media/p5/Comfyui/venv/bin/pip install -r /media/p5/ComfyUI-Prompt-Calibratror/requirements.txt

The node defaults to the huihui-ai Qwen3-VL-4B-Instruct abliterated weights already converted at /media/p5/qwen3vl_4b_abliterated_comfy_convert/ so it runs out of the box (the abliterated/uncensored variant won't refuse to analyze adult imagery, which would otherwise break the loop).

Pick a model in model_select and a quant in precision. All are abliterated, multimodal safetensors (transformers), auto-downloaded. The newer Qwen3.5/3.6 are natively multimodal (need a recent transformers — they load via AutoModelForMultimodalLM).

VRAM by quant on the RTX 5090 32 GB (✅ fits / ⚠ tight / ❌):

model	bf16	fp8	nf4	note
Qwen3-VL-4B (local)	✅ ~9	✅ ~5	✅ ~3	fast, weak
Qwen3-VL-8B	✅ ~17	✅ ~9	✅ ~6	solid, fast
Qwen3.5-9B	✅ ~20	✅ ~10	✅ ~7	newer, fast — recommended
Qwen3-VL-30B-A3B (MoE)	❌ ~62	⚠ ~31	✅ ~18	nf4 slow
Qwen3.6-27B (dense)	❌ ~56	⚠ ~28	✅ ~16	nf4 slow, strong
Qwen3.6-35B-A3B (MoE)	❌ ~70	❌	✅ ~20	nf4 slow, top quality

nf4 (bitsandbytes) fits the big ones but is slow (dequant overhead) — that's the bottleneck, not the model. fp8 is fast but only when a real fp8 checkpoint exists (the local 4B has one; precision=fp8 on a bf16-only repo won't quantize). For speed + recency, Qwen3.5-9B at bf16 is the sweet spot. See docs/METHODOLOGY.md.

Loop sketch

Prompt-Builder (SxCP) ──prompt──▶ T2I (SDXL/Flux/Krea2) ──image──▶ Qwen3-VL Image Judge
        ▲                                                                │
        └──────── knob overrides ◀── Controller ◀── overall_score + diff ┘

Use the Prompt-Builder For-Loop Start/End + Accumulator nodes to drive iterations and route overall_score into the stop condition. Controller options (greedy hill-climb → black-box optimizer → LLM-in-the-loop) are in the methodology doc.

End-to-end loop

Run ComfyUI with --listen, install this node pack, put your reference at ComfyUI/input/reference.png.
First pass (describe): the judge looks at the reference alone and emits one canonical scene description (coherent paragraph + per-axis target spec) to seed the prompt and anchor the loop:
```
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
  --run-tag seed --analysis-dir /media/p5/Comfyui/output/calibrator
```
Compare loop: load workflow/workflow_api.json (SDXL waiIllustriousSDXL_v160 example — swap the checkpoint for Flux/Krea as needed) and iterate, following docs/CALIBRATION_POLICY.md. Pass --ref-desc-file so compare anchors on the canonical reference (the ref side stays fixed; only the generated image is re-read each turn):
```
python agent_bridge.py --workflow workflow/workflow_api.json \
  --prompt "<description from step 2, then calibrated>" \
  --ref-desc-file /media/p5/Comfyui/output/calibrator/calib_seed.json \
  --run-tag iter001 --analysis-dir /media/p5/Comfyui/output/calibrator
```
stdout = the analysis JSON ({verdict, ref, gen} per axis) → agent steers toward ref → next iteration.

Status

Methodology + node selection (docs/METHODOLOGY.md)
Qwen3-VL Image Judge node — describe (first pass) + compare (scoring), swap-eval, file report
Agent-driven architecture (docs/AGENT_LOOP.md) — Receptor node + agent_bridge.py (--mode)
Example workflows: workflow_describe_api.json (first pass) + workflow_api.json (compare loop)
Agent calibration policy (docs/CALIBRATION_POLICY.md)
Optional: structured-config receptor (carry Prompt-Builder knobs instead of a flat string)

11 KiB Raw Permalink Blame History Unescape Escape