Files
ComfyUI-Prompt-Calibrator/README.md
T
Ethanfel 8b567cb531 chat mode: json_output toggle to return clean extracted JSON
For JSON-producing system prompts (e.g. LTX prompt-relay), json_output=true pulls
the JSON object out of the reply (strips reasoning/prose/code-fences via _parse_json,
which handles nested schemas and reasoning-then-JSON) and returns it re-serialized;
falls back to raw text if none parses. agent_bridge gains --json-output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 02:09:36 +02:00

167 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ComfyUI-Prompt-Calibratror
A **fully local** prompt calibration loop for ComfyUI. A vision-language model
(Qwen3-VL) judges how close a *generated* image is to a *reference* image and
returns a structured score + per-axis difference analysis, which is used to
**calibrate the prompt-generation method** ([ComfyUI-Prompt-Builder](../ComfyUI-Prompt-Builder))
until the generated image matches the reference.
> Full design rationale, controller options, and VLM-as-judge variance mitigations
> are in **[docs/METHODOLOGY.md](docs/METHODOLOGY.md)**. The controller is an **external
> CLI agent** that drives ComfyUI via its HTTP API — see **[docs/AGENT_LOOP.md](docs/AGENT_LOOP.md)**.
## Nodes & tools
| Component | What it is |
|---|---|
| `Qwen3-VL Image Judge (Calibrator)` | scores generated vs reference, writes analysis to disk for the agent |
| `SxCP External Prompt (Receptor)` | stable injection point; the agent sets `prompt/negative/seed` here per queue |
| `agent_bridge.py` | one CLI call = one iteration (inject → `POST /prompt` → wait → print analysis JSON) |
## The "vllm node": `Qwen3-VL Image Judge (Calibrator)`
The core node (`nodes/qwen_judge.py`). It reuses the standard transformers Qwen3-VL
inference plumbing (same approach as
[ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)
— the recommended reuse base) but **forces strict JSON output** so an automated loop
can act on it.
**Inputs**
| name | type | default | notes |
|---|---|---|---|
| `reference_image` | IMAGE | — | the target |
| `mode` | compare / describe / chat | compare | `compare` = score ref vs generated. `describe` = first pass over the reference → caption + target spec. `chat` = **general VLM**: your `system_prompt` + `user_prompt` over the image(s) → raw text |
| `profile` | general / oral / penetration / handjob / solo | general | **analysis profile** — act-specialized axis set; the act-critical axes are distance/proximity-aware (e.g. `mouth_genital_distance`) so magnitude isn't hidden behind a coarse label |
| `generated_image` | IMAGE (optional) | — | the candidate to score (required for `compare`, ignored for `describe`) |
| `model_select` | dropdown (model name) | 4B local | **which judge** (transformers/safetensors, auto-downloaded): Qwen3-VL 4B/8B/30B-A3B, **Qwen3.5-9B**, **Qwen3.6-27B/35B-A3B** (newer, natively multimodal). Param size shown in the label |
| `precision` | bf16 / fp8 / nf4 | bf16 | **the quant** — applies to the selected model (VRAM table below) |
| `model_path` | STRING | "" (empty) | **manual override** of the dropdown — local dir, HF repo id, or alias (`8b`/`30b-a3b`/`3.5-9b`/`3.6-27b`/`3.6-35b`). Empty = use `model_select` |
| `axes` | STRING **input** | — | (socket) optional override of the profile's axis set; wire a text node or leave unconnected to use `profile` |
| `max_new_tokens` | INT | 3072 | reasoning models (Qwen3.5/3.6) need room; raise it if the verdict gets cut off |
| `enable_thinking` | BOOL | true | let the model reason before judging. **Keep on for accurate verdicts** — off makes reasoning models rubber-stamp `match`. Off is faster |
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
| `keep_loaded` | BOOL | true | cache weights across loop iterations |
| `auto_download` | BOOL | true | if `model_path` is a repo id/alias and not local, fetch it from HF into `models/prompt_generator/` |
| `system_prompt` | STRING **input** | — | (socket) chat mode: wire your system prompt from a text node |
| `user_prompt` | STRING **input** | — | (socket) chat mode: wire your instruction from a text node |
| `reference_description` | STRING **input** | — | (socket) compare: wire describe's canonical output here to anchor the reference |
**Auto-download:** set `model_path` to `30b-a3b` (alias) or any `org/name` repo id and leave
`auto_download` on — the node snapshot-downloads it on first run (into ComfyUI's
`models/prompt_generator/<name>`) and reuses the local copy afterward. Local paths and the
default skip download entirely.
**General VLM (chat mode):** set `mode=chat` and the node becomes a plain vision-language
node — feed an image (and optionally a second), write your own `system_prompt`/`user_prompt`,
and read the model's text from the `analysis` output. Reuses the same model dropdown, quant,
and auto-download as the judge, so it's a one-node abliterated VLM for captioning, tagging,
Q&A, prompt-from-image, etc. (CLI: `agent_bridge.py --mode chat --user-prompt "..."`).
Set **`json_output=true`** for JSON-producing system prompts — it extracts the JSON object
from the reply (stripping any reasoning, prose, or ```fences) and returns it clean and
re-serialized (falls back to raw text if none parses). Works even with `enable_thinking` on.
## Performance / speed
This node runs models through **transformers `.generate()`** — the simplest path, but the
**slowest**: no PagedAttention / continuous batching / fused kernels like vLLM, SGLang, or
llama.cpp. With `enable_thinking` on, the model also emits thousands of reasoning tokens
(each token = one forward pass) — that's the cost of accurate verdicts. Levers, fastest first:
- **`swap_eval = false`** — halves the work (one reasoned pass instead of two). Biggest free win.
- **flash-attention** — the node auto-uses `flash_attention_2` if `flash-attn` is installed, else `sdpa`. `pip install flash-attn` for the speedup.
- **smaller model / fewer axes** — Qwen3.5-9B bf16 over the 27B/35B; trim `axes` or use a focused `profile`.
- **`enable_thinking = false`** — much faster, but reasoning models then rubber-stamp `match`; only for quick smoke tests.
- **avoid `nf4`** for speed — bitsandbytes dequantizes every step; `bf16`/`fp8` decode faster (nf4 is for *fitting* the big models, not speed).
The real fix for production speed is a different inference engine (vLLM/SGLang serve these
models many× faster) — a heavier, separate-server setup not built into this node.
**Outputs**
| name | type | use |
|---|---|---|
| `overall_score` | FLOAT 0..1 | compare: mean verdict (computed here, not by the model). describe: `1.0` placeholder |
| `axis_scores_json` | STRING (JSON) | compare: per-axis `{verdict, ref, gen}` (verdict = match/partial/mismatch). describe: `{axis: value}` |
| `analysis` | STRING | compare: header (`overall, N mismatches`) + axes worst-first (`VERDICT ref:[…] gen:[…]`). describe: the `caption`. chat: the model's response |
| `raw` | STRING | raw model output (both passes if `swap_eval`) |
| `report_path` | STRING | path to the written `calib_<tag>.json` (carries `mismatch_count`) |
## Install
```bash
cd /media/p5/Comfyui/custom_nodes
ln -s /media/p5/ComfyUI-Prompt-Calibratror . # or git clone
/media/p5/Comfyui/venv/bin/pip install -r /media/p5/ComfyUI-Prompt-Calibratror/requirements.txt
```
The node defaults to the **huihui-ai Qwen3-VL-4B-Instruct abliterated** weights already
converted at `/media/p5/qwen3vl_4b_abliterated_comfy_convert/` so it runs out of the box
(the abliterated/uncensored variant won't refuse to analyze adult imagery, which would
otherwise break the loop).
**Pick a model in `model_select` and a quant in `precision`.** All are abliterated,
multimodal **safetensors** (transformers), auto-downloaded. The newer **Qwen3.5/3.6** are
natively multimodal (need a recent transformers — they load via `AutoModelForMultimodalLM`).
VRAM by quant on the RTX 5090 32 GB (✅ fits / ⚠ tight / ❌):
| model | bf16 | fp8 | nf4 | note |
|---|---|---|---|---|
| Qwen3-VL-4B (local) | ✅ ~9 | ✅ ~5 | ✅ ~3 | fast, weak |
| Qwen3-VL-8B | ✅ ~17 | ✅ ~9 | ✅ ~6 | solid, fast |
| **Qwen3.5-9B** | ✅ ~20 | ✅ ~10 | ✅ ~7 | **newer, fast — recommended** |
| Qwen3-VL-30B-A3B (MoE) | ❌ ~62 | ⚠ ~31 | ✅ ~18 | nf4 slow |
| Qwen3.6-27B (dense) | ❌ ~56 | ⚠ ~28 | ✅ ~16 | nf4 slow, strong |
| Qwen3.6-35B-A3B (MoE) | ❌ ~70 | ❌ | ✅ ~20 | nf4 slow, top quality |
`nf4` (bitsandbytes) fits the big ones but is **slow** (dequant overhead) — that's the
bottleneck, not the model. `fp8` is fast but only when a real fp8 checkpoint exists (the
local 4B has one; `precision=fp8` on a bf16-only repo won't quantize). For speed + recency,
**Qwen3.5-9B at bf16** is the sweet spot. See
[docs/METHODOLOGY.md](docs/METHODOLOGY.md#model-sizing-on-32-gb-rtx-5090--abliterated-latest-qwen-vl).
## Loop sketch
```
Prompt-Builder (SxCP) ──prompt──▶ T2I (SDXL/Flux/Krea2) ──image──▶ Qwen3-VL Image Judge
▲ │
└──────── knob overrides ◀── Controller ◀── overall_score + diff ┘
```
Use the Prompt-Builder **For-Loop Start/End + Accumulator** nodes to drive iterations and
route `overall_score` into the stop condition. Controller options (greedy hill-climb →
black-box optimizer → LLM-in-the-loop) are in the methodology doc.
## End-to-end loop
1. Run ComfyUI with `--listen`, install this node pack, put your reference at `ComfyUI/input/reference.png`.
2. **First pass (describe):** the judge looks at the reference alone and emits **one canonical
scene description** (coherent paragraph + per-axis target spec) to seed the prompt *and*
anchor the loop:
```bash
python agent_bridge.py --mode describe --workflow workflow/workflow_describe_api.json \
--run-tag seed --analysis-dir /media/p5/Comfyui/output/calibrator
```
3. **Compare loop:** load `workflow/workflow_api.json` (SDXL `waiIllustriousSDXL_v160` example —
swap the checkpoint for Flux/Krea as needed) and iterate, following `docs/CALIBRATION_POLICY.md`.
Pass `--ref-desc-file` so compare anchors on the canonical reference (the `ref` side stays
fixed; only the generated image is re-read each turn):
```bash
python agent_bridge.py --workflow workflow/workflow_api.json \
--prompt "<description from step 2, then calibrated>" \
--ref-desc-file /media/p5/Comfyui/output/calibrator/calib_seed.json \
--run-tag iter001 --analysis-dir /media/p5/Comfyui/output/calibrator
```
stdout = the analysis JSON (`{verdict, ref, gen}` per axis) → agent steers toward `ref` → next iteration.
## Status
- [x] Methodology + node selection (`docs/METHODOLOGY.md`)
- [x] Qwen3-VL Image Judge node — `describe` (first pass) + `compare` (scoring), swap-eval, file report
- [x] Agent-driven architecture (`docs/AGENT_LOOP.md`) — Receptor node + `agent_bridge.py` (`--mode`)
- [x] Example workflows: `workflow_describe_api.json` (first pass) + `workflow_api.json` (compare loop)
- [x] Agent calibration policy (`docs/CALIBRATION_POLICY.md`)
- [ ] Optional: structured-config receptor (carry Prompt-Builder knobs instead of a flat string)