8b567cb531
For JSON-producing system prompts (e.g. LTX prompt-relay), json_output=true pulls the JSON object out of the reply (strips reasoning/prose/code-fences via _parse_json, which handles nested schemas and reasoning-then-JSON) and returns it re-serialized; falls back to raw text if none parses. agent_bridge gains --json-output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
167 lines
11 KiB
Markdown
167 lines
11 KiB
Markdown
# ComfyUI-Prompt-Calibratror
|
||
|
||
A **fully local** prompt calibration loop for ComfyUI. A vision-language model
|
||
(Qwen3-VL) judges how close a *generated* image is to a *reference* image and
|
||
returns a structured score + per-axis difference analysis, which is used to
|
||
**calibrate the prompt-generation method** ([ComfyUI-Prompt-Builder](../ComfyUI-Prompt-Builder))
|
||
until the generated image matches the reference.
|
||
|
||
> Full design rationale, controller options, and VLM-as-judge variance mitigations
|
||
> are in **[docs/METHODOLOGY.md](docs/METHODOLOGY.md)**. The controller is an **external
|
||
> CLI agent** that drives ComfyUI via its HTTP API — see **[docs/AGENT_LOOP.md](docs/AGENT_LOOP.md)**.
|
||
|
||
## Nodes & tools
|
||
|
||
| Component | What it is |
|
||
|---|---|
|
||
| `Qwen3-VL Image Judge (Calibrator)` | scores generated vs reference, writes analysis to disk for the agent |
|
||
| `SxCP External Prompt (Receptor)` | stable injection point; the agent sets `prompt/negative/seed` here per queue |
|
||
| `agent_bridge.py` | one CLI call = one iteration (inject → `POST /prompt` → wait → print analysis JSON) |
|
||
|
||
## The "vllm node": `Qwen3-VL Image Judge (Calibrator)`
|
||
|
||
The core node (`nodes/qwen_judge.py`). It reuses the standard transformers Qwen3-VL
|
||
inference plumbing (same approach as
|
||
[ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)
|
||
— the recommended reuse base) but **forces strict JSON output** so an automated loop
|
||
can act on it.
|
||
|
||
**Inputs**
|
||
|
||
| name | type | default | notes |
|
||
|---|---|---|---|
|
||
| `reference_image` | IMAGE | — | the target |
|
||
| `mode` | compare / describe / chat | compare | `compare` = score ref vs generated. `describe` = first pass over the reference → caption + target spec. `chat` = **general VLM**: your `system_prompt` + `user_prompt` over the image(s) → raw text |
|
||
| `profile` | general / oral / penetration / handjob / solo | general | **analysis profile** — act-specialized axis set; the act-critical axes are distance/proximity-aware (e.g. `mouth_genital_distance`) so magnitude isn't hidden behind a coarse label |
|
||
| `generated_image` | IMAGE (optional) | — | the candidate to score (required for `compare`, ignored for `describe`) |
|
||
| `model_select` | dropdown (model name) | 4B local | **which judge** (transformers/safetensors, auto-downloaded): Qwen3-VL 4B/8B/30B-A3B, **Qwen3.5-9B**, **Qwen3.6-27B/35B-A3B** (newer, natively multimodal). Param size shown in the label |
|
||
| `precision` | bf16 / fp8 / nf4 | bf16 | **the quant** — applies to the selected model (VRAM table below) |
|
||
| `model_path` | STRING | "" (empty) | **manual override** of the dropdown — local dir, HF repo id, or alias (`8b`/`30b-a3b`/`3.5-9b`/`3.6-27b`/`3.6-35b`). Empty = use `model_select` |
|
||
| `axes` | STRING **input** | — | (socket) optional override of the profile's axis set; wire a text node or leave unconnected to use `profile` |
|
||
| `max_new_tokens` | INT | 3072 | reasoning models (Qwen3.5/3.6) need room; raise it if the verdict gets cut off |
|
||
| `enable_thinking` | BOOL | true | let the model reason before judging. **Keep on for accurate verdicts** — off makes reasoning models rubber-stamp `match`. Off is faster |
|
||
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
|
||
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
|
||
| `keep_loaded` | BOOL | true | cache weights across loop iterations |
|
||
| `auto_download` | BOOL | true | if `model_path` is a repo id/alias and not local, fetch it from HF into `models/prompt_generator/` |
|
||
| `system_prompt` | STRING **input** | — | (socket) chat mode: wire your system prompt from a text node |
|
||
| `user_prompt` | STRING **input** | — | (socket) chat mode: wire your instruction from a text node |
|
||
| `reference_description` | STRING **input** | — | (socket) compare: wire describe's canonical output here to anchor the reference |
|
||
|
||
**Auto-download:** set `model_path` to `30b-a3b` (alias) or any `org/name` repo id and leave
|
||
`auto_download` on — the node snapshot-downloads it on first run (into ComfyUI's
|
||
`models/prompt_generator/<name>`) and reuses the local copy afterward. Local paths and the
|
||
default skip download entirely.
|
||
|
||
**General VLM (chat mode):** set `mode=chat` and the node becomes a plain vision-language
|
||
node — feed an image (and optionally a second), write your own `system_prompt`/`user_prompt`,
|
||
and read the model's text from the `analysis` output. Reuses the same model dropdown, quant,
|
||
and auto-download as the judge, so it's a one-node abliterated VLM for captioning, tagging,
|
||
Q&A, prompt-from-image, etc. (CLI: `agent_bridge.py --mode chat --user-prompt "..."`).
|
||
Set **`json_output=true`** for JSON-producing system prompts — it extracts the JSON object
|
||
from the reply (stripping any reasoning, prose, or ```fences) and returns it clean and
|
||
re-serialized (falls back to raw text if none parses). Works even with `enable_thinking` on.
|
||
|
||
## Performance / speed
|
||
|
||
This node runs models through **transformers `.generate()`** — the simplest path, but the
|
||
**slowest**: no PagedAttention / continuous batching / fused kernels like vLLM, SGLang, or
|
||
llama.cpp. With `enable_thinking` on, the model also emits thousands of reasoning tokens
|
||
(each token = one forward pass) — that's the cost of accurate verdicts. Levers, fastest first:
|
||
|
||
- **`swap_eval = false`** — halves the work (one reasoned pass instead of two). Biggest free win.
|
||
- **flash-attention** — the node auto-uses `flash_attention_2` if `flash-attn` is installed, else `sdpa`. `pip install flash-attn` for the speedup.
|
||
- **smaller model / fewer axes** — Qwen3.5-9B bf16 over the 27B/35B; trim `axes` or use a focused `profile`.
|
||
- **`enable_thinking = false`** — much faster, but reasoning models then rubber-stamp `match`; only for quick smoke tests.
|
||
- **avoid `nf4`** for speed — bitsandbytes dequantizes every step; `bf16`/`fp8` decode faster (nf4 is for *fitting* the big models, not speed).
|
||
|
||
The real fix for production speed is a different inference engine (vLLM/SGLang serve these
|
||
models many× faster) — a heavier, separate-server setup not built into this node.
|
||
|
||
**Outputs**
|
||
|
||
| name | type | use |
|
||
|---|---|---|
|
||
| `overall_score` | FLOAT 0..1 | compare: mean verdict (computed here, not by the model). describe: `1.0` placeholder |
|
||
| `axis_scores_json` | STRING (JSON) | compare: per-axis `{verdict, ref, gen}` (verdict = match/partial/mismatch). describe: `{axis: value}` |
|
||
| `analysis` | STRING | compare: header (`overall, N mismatches`) + axes worst-first (`VERDICT ref:[…] gen:[…]`). describe: the `caption`. chat: the model's response |
|
||
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
|
||
| `report_path` | STRING | path to the written `calib_<tag>.json` (carries `mismatch_count`) |
|
||
|
||
## Install
|
||
|
||
```bash
|
||
cd /media/p5/Comfyui/custom_nodes
|
||
ln -s /media/p5/ComfyUI-Prompt-Calibratror . # or git clone
|
||
/media/p5/Comfyui/venv/bin/pip install -r /media/p5/ComfyUI-Prompt-Calibratror/requirements.txt
|
||
```
|
||
|
||
The node defaults to the **huihui-ai Qwen3-VL-4B-Instruct abliterated** weights already
|
||
converted at `/media/p5/qwen3vl_4b_abliterated_comfy_convert/` so it runs out of the box
|
||
(the abliterated/uncensored variant won't refuse to analyze adult imagery, which would
|
||
otherwise break the loop).
|
||
|
||
**Pick a model in `model_select` and a quant in `precision`.** All are abliterated,
|
||
multimodal **safetensors** (transformers), auto-downloaded. The newer **Qwen3.5/3.6** are
|
||
natively multimodal (need a recent transformers — they load via `AutoModelForMultimodalLM`).
|
||
|
||
VRAM by quant on the RTX 5090 32 GB (✅ fits / ⚠ tight / ❌):
|
||
|
||
| model | bf16 | fp8 | nf4 | note |
|
||
|---|---|---|---|---|
|
||
| Qwen3-VL-4B (local) | ✅ ~9 | ✅ ~5 | ✅ ~3 | fast, weak |
|
||
| Qwen3-VL-8B | ✅ ~17 | ✅ ~9 | ✅ ~6 | solid, fast |
|
||
| **Qwen3.5-9B** | ✅ ~20 | ✅ ~10 | ✅ ~7 | **newer, fast — recommended** |
|
||
| Qwen3-VL-30B-A3B (MoE) | ❌ ~62 | ⚠ ~31 | ✅ ~18 | nf4 slow |
|
||
| Qwen3.6-27B (dense) | ❌ ~56 | ⚠ ~28 | ✅ ~16 | nf4 slow, strong |
|
||
| Qwen3.6-35B-A3B (MoE) | ❌ ~70 | ❌ | ✅ ~20 | nf4 slow, top quality |
|
||
|
||
`nf4` (bitsandbytes) fits the big ones but is **slow** (dequant overhead) — that's the
|
||
bottleneck, not the model. `fp8` is fast but only when a real fp8 checkpoint exists (the
|
||
local 4B has one; `precision=fp8` on a bf16-only repo won't quantize). For speed + recency,
|
||
**Qwen3.5-9B at bf16** is the sweet spot. See
|
||
[docs/METHODOLOGY.md](docs/METHODOLOGY.md#model-sizing-on-32-gb-rtx-5090--abliterated-latest-qwen-vl).
|
||
|
||
## Loop sketch
|
||
|
||
```
|
||
Prompt-Builder (SxCP) ──prompt──▶ T2I (SDXL/Flux/Krea2) ──image──▶ Qwen3-VL Image Judge
|
||
▲ │
|
||
└──────── knob overrides ◀── Controller ◀── overall_score + diff ┘
|
||
```
|
||
|
||
Use the Prompt-Builder **For-Loop Start/End + Accumulator** nodes to drive iterations and
|
||
route `overall_score` into the stop condition. Controller options (greedy hill-climb →
|
||
black-box optimizer → LLM-in-the-loop) are in the methodology doc.
|
||
|
||
## End-to-end loop
|
||
|
||
1. Run ComfyUI with `--listen`, install this node pack, put your reference at `ComfyUI/input/reference.png`.
|
||
2. **First pass (describe):** the judge looks at the reference alone and emits **one canonical
|
||
scene description** (coherent paragraph + per-axis target spec) to seed the prompt *and*
|
||
anchor the loop:
|
||
```bash
|
||
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
|
||
--run-tag seed --analysis-dir /media/p5/Comfyui/output/calibrator
|
||
```
|
||
3. **Compare loop:** load `workflow/workflow_api.json` (SDXL `waiIllustriousSDXL_v160` example —
|
||
swap the checkpoint for Flux/Krea as needed) and iterate, following `docs/CALIBRATION_POLICY.md`.
|
||
Pass `--ref-desc-file` so compare anchors on the canonical reference (the `ref` side stays
|
||
fixed; only the generated image is re-read each turn):
|
||
```bash
|
||
python agent_bridge.py --workflow workflow/workflow_api.json \
|
||
--prompt "<description from step 2, then calibrated>" \
|
||
--ref-desc-file /media/p5/Comfyui/output/calibrator/calib_seed.json \
|
||
--run-tag iter001 --analysis-dir /media/p5/Comfyui/output/calibrator
|
||
```
|
||
stdout = the analysis JSON (`{verdict, ref, gen}` per axis) → agent steers toward `ref` → next iteration.
|
||
|
||
## Status
|
||
|
||
- [x] Methodology + node selection (`docs/METHODOLOGY.md`)
|
||
- [x] Qwen3-VL Image Judge node — `describe` (first pass) + `compare` (scoring), swap-eval, file report
|
||
- [x] Agent-driven architecture (`docs/AGENT_LOOP.md`) — Receptor node + `agent_bridge.py` (`--mode`)
|
||
- [x] Example workflows: `workflow_describe_api.json` (first pass) + `workflow_api.json` (compare loop)
|
||
- [x] Agent calibration policy (`docs/CALIBRATION_POLICY.md`)
|
||
- [ ] Optional: structured-config receptor (carry Prompt-Builder knobs instead of a flat string)
|