commit 95198a15b56025e734eb9c9dab616f1db6560e91 Author: Ethanfel Date: Fri Jun 26 22:15:56 2026 +0200 Initial commit: VLM-as-judge prompt calibration loop Qwen3-VL image-similarity judge node, external-prompt receptor node, agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/ calibration-policy docs. Co-Authored-By: Claude Opus 4.8 diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..ffda33d --- /dev/null +++ b/.gitignore @@ -0,0 +1,8 @@ +__pycache__/ +*.pyc +output/ +models/ +*.safetensors +*.gguf +.DS_Store +.venv/ diff --git a/README.md b/README.md new file mode 100644 index 0000000..ff6b214 --- /dev/null +++ b/README.md @@ -0,0 +1,110 @@ +# ComfyUI-Prompt-Calibratror + +A **fully local** prompt calibration loop for ComfyUI. A vision-language model +(Qwen3-VL) judges how close a *generated* image is to a *reference* image and +returns a structured score + per-axis difference analysis, which is used to +**calibrate the prompt-generation method** ([ComfyUI-Prompt-Builder](../ComfyUI-Prompt-Builder)) +until the generated image matches the reference. + +> Full design rationale, controller options, and VLM-as-judge variance mitigations +> are in **[docs/METHODOLOGY.md](docs/METHODOLOGY.md)**. The controller is an **external +> CLI agent** that drives ComfyUI via its HTTP API — see **[docs/AGENT_LOOP.md](docs/AGENT_LOOP.md)**. + +## Nodes & tools + +| Component | What it is | +|---|---| +| `Qwen3-VL Image Judge (Calibrator)` | scores generated vs reference, writes analysis to disk for the agent | +| `SxCP External Prompt (Receptor)` | stable injection point; the agent sets `prompt/negative/seed` here per queue | +| `agent_bridge.py` | one CLI call = one iteration (inject → `POST /prompt` → wait → print analysis JSON) | + +## The "vllm node": `Qwen3-VL Image Judge (Calibrator)` + +The core node (`nodes/qwen_judge.py`). It reuses the standard transformers Qwen3-VL +inference plumbing (same approach as +[ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage) +— the recommended reuse base) but **forces strict JSON output** so an automated loop +can act on it. + +**Inputs** + +| name | type | default | notes | +|---|---|---|---| +| `reference_image` | IMAGE | — | the target | +| `generated_image` | IMAGE | — | the candidate to score | +| `model_path` | STRING | `/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16` | local dir, **HF repo id** (`org/name`), or alias (`30b-a3b` / `8b` / `4b`) | +| `precision` | bf16 / fp16 / fp8 / nf4 | bf16 | `nf4` = 4-bit (run the 30B judge on 32 GB); `fp8` with the `hf_fp8` copy | +| `axes` | STRING | cast, clothing, pose, scene, composition, expression, color_light | scored axes (match your Prompt-Builder knobs) | +| `max_new_tokens` | INT | 512 | | +| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable | +| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias | +| `keep_loaded` | BOOL | true | cache weights across loop iterations | +| `auto_download` | BOOL | true | if `model_path` is a repo id/alias and not local, fetch it from HF into `models/prompt_generator/` | + +**Auto-download:** set `model_path` to `30b-a3b` (alias) or any `org/name` repo id and leave +`auto_download` on — the node snapshot-downloads it on first run (into ComfyUI's +`models/prompt_generator/`) and reuses the local copy afterward. Local paths and the +default skip download entirely. + +**Outputs** + +| name | type | use | +|---|---|---| +| `overall_score` | FLOAT 0..1 | loop stop-condition / objective | +| `axis_scores_json` | STRING (JSON) | per-axis `{score, diff}` for the controller | +| `diff_analysis` | STRING | human/controller-readable summary + fix suggestions | +| `raw` | STRING | raw model output (both passes if `swap_eval`) | + +## Install + +```bash +cd /media/p5/Comfyui/custom_nodes +ln -s /media/p5/ComfyUI-Prompt-Calibratror . # or git clone +/media/p5/Comfyui/venv/bin/pip install -r /media/p5/ComfyUI-Prompt-Calibratror/requirements.txt +``` + +The node defaults to the **huihui-ai Qwen3-VL-4B-Instruct abliterated** weights already +converted at `/media/p5/qwen3vl_4b_abliterated_comfy_convert/` so it runs out of the box +(the abliterated/uncensored variant won't refuse to analyze adult imagery, which would +otherwise break the loop). + +**Recommended upgrade (latest Qwen VL + uncensored, fits 32 GB):** +[`huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated`](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated) +— MoE (3B active, fast), run at `precision=nf4` (~18 GB). The node auto-detects the MoE +class. An easier middle ground is the **8B** abliterated at `bf16` (~17 GB, no quantization). +Qwen3.5-VL abliterated isn't out yet (Qwen3.5 abliterated builds are text-only so far); +Gemma-3-27B-it abliterated (4-bit) is a viable non-Qwen alternative. See +[docs/METHODOLOGY.md](docs/METHODOLOGY.md#model-sizing-on-32-gb-rtx-5090--abliterated-latest-qwen-vl). + +## Loop sketch + +``` +Prompt-Builder (SxCP) ──prompt──▶ T2I (SDXL/Flux/Krea2) ──image──▶ Qwen3-VL Image Judge + ▲ │ + └──────── knob overrides ◀── Controller ◀── overall_score + diff ┘ +``` + +Use the Prompt-Builder **For-Loop Start/End + Accumulator** nodes to drive iterations and +route `overall_score` into the stop condition. Controller options (greedy hill-climb → +black-box optimizer → LLM-in-the-loop) are in the methodology doc. + +## End-to-end loop + +1. Run ComfyUI with `--listen`, install this node pack, put your reference at `ComfyUI/input/reference.png`. +2. Load `workflow/workflow_api.json` (SDXL `waiIllustriousSDXL_v160` example — swap the checkpoint for Flux/Krea as needed). +3. Drive it from your agent following `docs/CALIBRATION_POLICY.md`: + ```bash + python agent_bridge.py --workflow workflow/workflow_api.json \ + --prompt "1 woman, red lingerie, bedroom, full body, warm light" \ + --run-tag iter001 --analysis-dir /media/p5/Comfyui/output/calibrator + ``` + stdout = the analysis JSON → agent calibrates → next iteration. + +## Status + +- [x] Methodology + node selection (`docs/METHODOLOGY.md`) +- [x] Qwen3-VL Image Judge node (structured JSON scoring, swap-eval, model caching, file report) +- [x] Agent-driven architecture (`docs/AGENT_LOOP.md`) — Receptor node + `agent_bridge.py` +- [x] Example end-to-end workflow (`workflow/workflow_api.json`) +- [x] Agent calibration policy (`docs/CALIBRATION_POLICY.md`) +- [ ] Optional: structured-config receptor (carry Prompt-Builder knobs instead of a flat string) diff --git a/__init__.py b/__init__.py new file mode 100644 index 0000000..291c07e --- /dev/null +++ b/__init__.py @@ -0,0 +1,15 @@ +"""ComfyUI-Prompt-Calibratror — VLM-as-judge prompt calibration loop.""" + +from .nodes.qwen_judge import ( + NODE_CLASS_MAPPINGS as _JUDGE_CLASSES, + NODE_DISPLAY_NAME_MAPPINGS as _JUDGE_NAMES, +) +from .nodes.receptor import ( + NODE_CLASS_MAPPINGS as _RECEPTOR_CLASSES, + NODE_DISPLAY_NAME_MAPPINGS as _RECEPTOR_NAMES, +) + +NODE_CLASS_MAPPINGS = {**_JUDGE_CLASSES, **_RECEPTOR_CLASSES} +NODE_DISPLAY_NAME_MAPPINGS = {**_JUDGE_NAMES, **_RECEPTOR_NAMES} + +__all__ = ["NODE_CLASS_MAPPINGS", "NODE_DISPLAY_NAME_MAPPINGS"] diff --git a/agent_bridge.py b/agent_bridge.py new file mode 100644 index 0000000..64898d2 --- /dev/null +++ b/agent_bridge.py @@ -0,0 +1,146 @@ +#!/usr/bin/env python3 +""" +agent_bridge.py — drive one calibration iteration from a CLI agent. + +The external agent (controller/brain) calls this once per loop step: + + python agent_bridge.py \ + --workflow workflow_api.json \ + --prompt "1 woman, red lingerie, bedroom, full body, warm light" \ + --run-tag iter003 \ + --analysis-dir /path/to/ComfyUI/output/calibrator + +It injects the prompt into the `CalibratorPromptReceptor` node, queues the graph +on a running ComfyUI (`POST /prompt`), waits for completion (`GET /history/{id}`), +then prints the Qwen3-VL Judge's analysis JSON to stdout for the agent to read. + +Stdlib only — no third-party deps, so any agent can shell out to it. + +Loop, from the agent's side: + 1. build a prompt (calibrate from the previous analysis) + 2. run this script -> capture stdout (the analysis JSON) + 3. read overall_score + per-axis diffs + fix_suggestions + 4. adjust the prompt and go to 1, until overall_score >= target +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import time +import urllib.error +import urllib.request +import uuid + +RECEPTOR_CLASS = "CalibratorPromptReceptor" +JUDGE_CLASS = "QwenVLImageJudge" + + +def _http_json(url: str, payload: dict | None = None, timeout: int = 30): + data = json.dumps(payload).encode("utf-8") if payload is not None else None + req = urllib.request.Request( + url, data=data, headers={"Content-Type": "application/json"} if data else {}) + with urllib.request.urlopen(req, timeout=timeout) as resp: + body = resp.read().decode("utf-8") + return json.loads(body) if body else {} + + +def _inject(graph: dict, prompt: str, negative: str, seed: int, run_tag: str): + """Set the receptor's prompt/negative/seed and the judge's run_tag in-place.""" + found_receptor = False + for node in graph.values(): + ctype = node.get("class_type") + inputs = node.setdefault("inputs", {}) + if ctype == RECEPTOR_CLASS: + inputs["prompt"] = prompt + inputs["negative"] = negative + inputs["seed"] = int(seed) + found_receptor = True + elif ctype == JUDGE_CLASS: + inputs["run_tag"] = run_tag + inputs["prompt_used"] = prompt + if not found_receptor: + raise SystemExit( + f"[agent_bridge] no '{RECEPTOR_CLASS}' node in the workflow — add the " + f"'SxCP External Prompt (Receptor)' node and feed the sampler from it.") + + +def _wait_for_history(server: str, prompt_id: str, timeout: int): + deadline = time.time() + timeout + while time.time() < deadline: + hist = _http_json(f"http://{server}/history/{prompt_id}") + if prompt_id in hist: + entry = hist[prompt_id] + status = entry.get("status", {}) + # ComfyUI marks completed=True (or status_str) when the run is done. + if status.get("completed", True): + return entry + time.sleep(1.0) + raise SystemExit(f"[agent_bridge] timed out after {timeout}s waiting for {prompt_id}") + + +def _read_report(analysis_file: str, analysis_dir: str, run_tag: str): + candidates = [] + if analysis_file: + candidates.append(analysis_file) + if analysis_dir: + if run_tag: + safe = "".join(c if c.isalnum() or c in "._-" else "_" for c in run_tag) + candidates.append(os.path.join(analysis_dir, f"calib_{safe}.json")) + candidates.append(os.path.join(analysis_dir, "latest.json")) + for path in candidates: + if os.path.isfile(path): + with open(path, "r", encoding="utf-8") as f: + return json.load(f), path + return None, None + + +def main(argv=None): + ap = argparse.ArgumentParser(description="Drive one ComfyUI calibration iteration.") + ap.add_argument("--server", default="127.0.0.1:8188") + ap.add_argument("--workflow", required=True, help="API-format workflow JSON") + ap.add_argument("--prompt", required=True) + ap.add_argument("--negative", default="") + ap.add_argument("--seed", type=int, default=0) + ap.add_argument("--run-tag", default="") + ap.add_argument("--analysis-file", default="", + help="explicit path to the report JSON the Judge writes") + ap.add_argument("--analysis-dir", default="", + help="dir holding calib_.json / latest.json (Judge report_dir)") + ap.add_argument("--timeout", type=int, default=600) + args = ap.parse_args(argv) + + with open(args.workflow, "r", encoding="utf-8") as f: + graph = json.load(f) + + _inject(graph, args.prompt, args.negative, args.seed, args.run_tag) + + client_id = uuid.uuid4().hex + try: + queued = _http_json(f"http://{args.server}/prompt", + {"prompt": graph, "client_id": client_id}) + except urllib.error.URLError as e: + raise SystemExit(f"[agent_bridge] cannot reach ComfyUI at {args.server}: {e}") + prompt_id = queued.get("prompt_id") + if not prompt_id: + raise SystemExit(f"[agent_bridge] queue rejected: {json.dumps(queued)[:400]}") + + _wait_for_history(args.server, prompt_id, args.timeout) + + report, path = _read_report(args.analysis_file, args.analysis_dir, args.run_tag) + if report is None: + raise SystemExit( + "[agent_bridge] run finished but no report file found. Set the Judge " + "node's report_dir and pass --analysis-dir (or --analysis-file).") + + report["_prompt_id"] = prompt_id + report["_report_path"] = path + json.dump(report, sys.stdout, ensure_ascii=False, indent=2) + sys.stdout.write("\n") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/docs/AGENT_LOOP.md b/docs/AGENT_LOOP.md new file mode 100644 index 0000000..cc70c00 --- /dev/null +++ b/docs/AGENT_LOOP.md @@ -0,0 +1,87 @@ +# Agent-driven calibration loop + +The controller is an **external CLI agent**, not an in-graph node. ComfyUI is the +execution environment (prompt receptor → T2I → VLM judge); the agent is the brain that +reads the analysis, calibrates the prompt generator, and queues the next iteration. + +``` + CLI AGENT (controller / brain) COMFYUI (execution, running with --listen) + ─────────────────────────────── ────────────────────────────────────────── + 1. build/calibrate a prompt + 2. agent_bridge.py --prompt ... ───POST /prompt──► CalibratorPromptReceptor (injection point) + │ prompt / negative / seed + ▼ + T2I (SDXL / Flux / Krea2) + │ generated image + ▼ + Qwen3-VL Image Judge + │ writes calib_.json + latest.json + 3. poll /history/{id} (bridge does this) ◄───────────┘ + 4. read report JSON (overall_score, + per-axis diffs, fix_suggestions) + 5. adjust Prompt-Builder knobs / prompt + └──► go to 1 until overall_score ≥ target +``` + +## Why API-driven, not file-watch + +A passive "watch a file and auto-run" receptor is fragile in ComfyUI (no native file +watcher / auto-queue, and prompt↔image↔analysis can desync). Driving `POST /prompt` +instead makes every iteration **synchronous and ordered** — one `prompt_id` ties the +prompt, the image, and the analysis together. The receptor node is still the clean +injection point; the agent just overrides its widgets per queue. (The receptor *also* +supports a `source_file` for file-first workflows if you ever want it.) + +## The three pieces + +| Piece | Role | +|---|---| +| `CalibratorPromptReceptor` (`SxCP External Prompt (Receptor)`) | Stable node the agent injects `prompt/negative/seed` into. Feeds the sampler. | +| `QwenVLImageJudge` (`Qwen3-VL Image Judge (Calibrator)`) | Scores generated vs reference; writes `calib_.json`, `latest.json`, `calib_.md` to `report_dir`. | +| `agent_bridge.py` | One CLI call = one iteration: inject prompt → queue → wait → print the analysis JSON to stdout. Stdlib only. | + +## One iteration (what the agent runs) + +```bash +python agent_bridge.py \ + --server 127.0.0.1:8188 \ + --workflow workflow_api.json \ + --prompt "1 woman, red lingerie, bedroom, full body, warm rim light" \ + --negative "blurry, deformed" \ + --seed 12345 \ + --run-tag iter003 \ + --analysis-dir /media/p5/Comfyui/output/calibrator +``` + +Stdout (captured by the agent) is the report: + +```json +{ + "run_tag": "iter003", + "overall_score": 0.62, + "axes": { + "pose": {"score": 0.40, "diff": "ref standing, gen seated"}, + "clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"} + }, + "fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"], + "prompt_used": "1 woman, red lingerie, ...", + "_prompt_id": "…", "_report_path": "…/calib_iter003.json" +} +``` + +## Agent calibration policy (suggested) + +The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the +`fix_suggestions`, regenerates, and keeps changes that raise `overall_score` +(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so +the score reflects the prompt, not sampler noise; vary the seed only once you're near the +target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every +`(prompt, knobs, score)` so the search is auditable/resumable. + +## Setup checklist + +1. Run ComfyUI with `--listen` (so the bridge can POST). Install this node pack. +2. Build a workflow with: `CalibratorPromptReceptor` → (Prompt-Builder formatting, optional) → T2I → `QwenVLImageJudge` (feed the **reference** image into `reference_image`, the T2I output into `generated_image`). +3. Set the Judge's `report_dir` to a known path; pass the same path as `--analysis-dir`. +4. Export the workflow in **API format** (`workflow_api.json`). +5. Drive it from the agent with `agent_bridge.py`, once per iteration. diff --git a/docs/CALIBRATION_POLICY.md b/docs/CALIBRATION_POLICY.md new file mode 100644 index 0000000..04d69e9 --- /dev/null +++ b/docs/CALIBRATION_POLICY.md @@ -0,0 +1,135 @@ +# Calibration policy — the agent's playbook + +This is the instruction set the **external CLI agent** (the controller) follows each +iteration. Paste the "Agent system prompt" block into your agent, give it the workflow +path + reference image + target score, and let it loop. + +The agent calibrates by reasoning over the **Prompt‑Builder axes** and editing a +structured *axis state*, then **rendering that state to a prompt string** that it injects +into the `CalibratorPromptReceptor`. This keeps the reasoning axis‑aware while staying +compatible with the flat‑string receptor. (If you later switch the receptor to carry a +structured config, the same axis state maps straight onto Prompt‑Builder's split control +nodes.) + +--- + +## Axis state (the agent's working memory) + +```json +{ + "cast": "1 woman, mid-20s, athletic", + "clothing": "red lace lingerie", + "pose": "standing, hand on hip", + "scene": "dimly lit bedroom", + "composition": "full-body shot, slight low angle", + "expression": "soft smile, eye contact", + "color_light": "warm rim light, shallow depth of field", + "quality": "photorealistic, high detail", + "negative": "blurry, deformed, lowres, extra limbs", + "seed": 12345 +} +``` + +These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried +but not scored. Render order (subject → wardrobe → action → setting → framing → affect → +light → quality): + +``` +prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality]) +``` + +--- + +## Per‑iteration algorithm (greedy per‑axis hill‑climb) + +``` +best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0 +loop: + i += 1 + prompt = render(state) + report = run agent_bridge.py --prompt prompt --negative state.negative + --seed state.seed --run-tag iter{i} + --workflow wf.json --analysis-dir + score = report.overall_score + if score >= TARGET: # e.g. 0.85 + stop("converged", state, score) + if score > best_score: + best_score = score ; best_state = state ; stale = 0 + else: + stale += 1 + state = best_state # revert: undo the change that didn't help + if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25 + stop("plateau/budget", best_state, best_score) + + # choose the next single edit: + worst_axis = axis with lowest per-axis score in report.axes + edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion + state = apply(best_state, worst_axis, edit) # change ONE axis only +``` + +### Rules that matter + +1. **Change one axis per iteration.** One edit = clean attribution of the score delta. + Only batch two edits when two axes score very low *and* are clearly independent. +2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not + sampler noise. Vary the seed only after you've converged, to confirm robustness. +3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the + "revert on no improvement" step. Prevents drifting down a bad path. +4. **Target the lowest‑scoring axis first**, applying the Judge's matching + `fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value + for that axis before moving on. +5. **Near the margin, don't over‑trust one reading.** `swap_eval` already averages two + orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare + averages before committing. +6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size + (smaller, more specific wording changes) and re‑anchor on `best_state`. +7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, per‑axis)`. + The run must be auditable and resumable. + +### Mapping `fix_suggestions` → axes + +The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to +clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous, +attribute it to the lowest‑scoring axis it plausibly affects. + +--- + +## Worked example + +``` +iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41 + axes: scene 0.30 (worst) — "ref bedroom, gen kitchen" + fix: "set scene to a dim bedroom" +iter2 edit scene→"dimly lit bedroom" score=0.58 (kept) + axes: pose 0.35 (worst) — "ref standing, gen seated" +iter3 edit pose→"standing, hand on hip" score=0.71 (kept) + axes: color_light 0.50 (worst) — "ref warm, gen flat" +iter4 edit color_light→"warm rim light" score=0.69 (worse → revert) +iter5 edit color_light→"warm golden hour glow" score=0.83 (kept) + axes: clothing 0.78 (worst) — "gen lacks lace detail" +iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP +``` + +--- + +## Agent system prompt (paste into your CLI agent) + +> You are the controller for a local image prompt calibrator. Goal: make a generated +> image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes +> (cast, clothing, pose, scene, composition, expression, color_light) from 0–1. +> +> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a +> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→ +> quality; (2) run `python agent_bridge.py --workflow --prompt "" +> --negative "" --seed --run-tag iter --analysis-dir +> `; (3) read the printed JSON report. +> +> Then apply greedy per‑axis hill‑climb: keep the change only if `overall_score` improved, +> else revert to the best state; pick the **lowest‑scoring axis** and apply the Judge's +> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching. +> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 non‑improving +> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score. +> +> Never change more than one axis at a time unless two axes are both very low and clearly +> independent. Never trust a single near‑margin reading — re‑run on a second seed when two +> candidates are within 0.03. diff --git a/docs/METHODOLOGY.md b/docs/METHODOLOGY.md new file mode 100644 index 0000000..3a70bd1 --- /dev/null +++ b/docs/METHODOLOGY.md @@ -0,0 +1,198 @@ +# Local Prompt Calibrator — Methodology + +> Goal: a **fully local** ComfyUI feedback loop where a vision‑language model (VLM) +> scores how close a *generated* image is to a *reference* image, and that score + +> a structured difference analysis is used to **calibrate the prompt‑generation +> method** ([ComfyUI‑Prompt‑Builder](../../ComfyUI-Prompt-Builder), the "SxCP" nodes) +> until the generated image matches the reference. + +--- + +## 1. The loop at a glance + +``` + ┌──────────────────────────────────────────────┐ + │ REFERENCE image (the target look) │ + └───────────────┬──────────────────────────────┘ + │ + ┌────────────────────▼────────────────┐ calibration deltas + │ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob + │ seeded pools + profile knobs │ overrides / seed move) + └────────────────────┬────────────────┘ + │ prompt + negative + ┌────────────────────▼────────────────┐ + │ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while + └────────────────────┬────────────────┘ searching the prompt axes + │ generated image + ┌────────────────────▼──────────────────────────────────┐ + │ Qwen3-VL JUDGE node ── the "vllm node" │ + │ in : reference + generated │ + │ out: overall_score 0..1 │ + │ per-axis scores (cast, clothing, pose, scene, │ + │ composition, expression, color/lighting) │ + │ diff_analysis (JSON: what's off + how to fix, │ + │ phrased in Prompt-Builder axis vocabulary) │ + └────────────────────┬──────────────────────────────────┘ + │ score + diffs + ┌────────────────────▼────────────────┐ + │ CALIBRATOR / controller │ + │ - accumulate per-axis scores │ + │ - map diffs → axis adjustments │ + │ - update Prompt-Builder knobs │ + │ - stop when overall_score ≥ target │ + │ or max iterations reached │ + └──────────────────────────────────────┘ +``` + +The novel piece is the **Judge node**. Off‑the‑shelf Qwen‑VL nodes emit free text; +a calibrator needs a **machine‑readable score + per‑axis diffs** so the controller +can act on them. That is what `nodes/qwen_judge.py` in this repo provides. + +--- + +## 2. The VLLM node — what to reuse + +You already have the model converted locally: + +``` +/media/p5/qwen3vl_4b_abliterated_comfy_convert/ + ├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16 + └── hf_fp8/ ← same model, FP8 (≈4–5 GB, trivially fits the RTX 5090 32 GB) +``` + +The **abliterated** variant matters: stock Qwen3‑VL will often refuse to "describe or +analyze" adult imagery, which would break the loop. huihui‑ai removed the text‑side +refusal direction, so it scores NSFW reference/generated pairs without bailing. + +### Reusable ComfyUI nodes (pick one as the plumbing base) + +| Repo | Backend | Multi‑image | Local path | Notes | +|---|---|---|---|---| +| **[hardik-uppal/ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)** | transformers | ✅ `images` + `images_batch_2/3` | needs tiny tweak | **Best base** — built for "compare these images, describe the differences"; supports FP16 / 8‑bit / 4‑bit **and pre‑quantized FP8** (matches your `hf_fp8`). | +| [IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct](https://github.com/IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct) | transformers | ✅ multi‑image query | HF download | Clean native Qwen3‑VL‑Instruct integration. | +| [jren712/ComfyUI-QwenVL-abliterated](https://github.com/jren712/ComfyUI-QwenVL-abliterated) | transformers | ✅ | abliterated‑oriented | Fork tuned for the abliterated weights. | +| [1038lab/ComfyUI-QwenVL](https://github.com/1038lab/ComfyUI-QwenVL) | **GGUF** (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. | + +**Recommendation:** don't run any of them *as‑is* for the loop — they only output text. +Instead reuse their **model‑load + `apply_chat_template` + `generate`** plumbing inside +a purpose‑built **Judge node** (this repo) that forces structured JSON output. The +`ComfyUI-QwenVL-MultiImage` loader is the closest template (it already handles two +image batches + FP8). + +### Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL + +As of June 2026 the **latest Qwen VL family is Qwen3‑VL** (Qwen3.5‑VL shipped early +2026, but abliterated builds of it are **text‑only so far** — no uncensored +Qwen3.5‑*VL* yet). So "latest + uncensored + fits 32 GB" = **Qwen3‑VL‑30B‑A3B abliterated**. +All rows below are huihui‑ai abliterated (uncensored) weights: + +| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict | +|---|---|---|---| +| **Qwen3‑VL‑30B‑A3B‑Instruct** ([HF](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated)) | **nf4 (4‑bit)** or GGUF Q4_K_M | ~18 GB | **Best judge that fits.** MoE → only 3B active, so it's fast despite 30B total. transformers class `Qwen3VLMoeForConditionalGeneration` (auto‑detected by the node). | +| Qwen3‑VL‑8B‑Instruct ([HF](https://huggingface.co/huihui-ai)) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; drop‑in for the judge node. | +| Qwen3‑VL‑4B‑Instruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. | + +**Gemma alternative:** Gemma‑3‑27B‑it (abliterated, 4‑bit ~16 GB) is a solid different +visual prior if you want a second opinion, but the Krea2 text encoder + Prompt‑Builder +are already Qwen‑aligned, so staying on Qwen3‑VL keeps the vocabulary consistent. + +Download an upgrade and point the node's `model_path` at it: +```bash +hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \ + --local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated +# then in the Judge node: model_path=, precision=nf4 +``` + +Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always +co‑reside — run them as **separate queue steps** and let ComfyUI unload between; the loop +is sequential anyway. The 8B bf16 judge co‑resides more easily. + +--- + +## 3. Scoring rubric (what the VLM actually returns) + +The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and a score +per axis, where the axes mirror what Prompt‑Builder can control. This is what makes the +diff *actionable* instead of generic prose. + +```json +{ + "overall_score": 0.0, + "axes": { + "cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"}, + "clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"}, + "pose": {"score": 0.0, "diff": "ref standing vs gen seated"}, + "scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"}, + "composition": {"score": 0.0, "diff": "ref full body vs gen close-up"}, + "expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"}, + "color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"} + }, + "fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...] +} +``` + +The axis list is **configurable** on the node so it can match whichever Prompt‑Builder +knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression, +color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can +map each one onto a knob. + +### Reducing VLM‑as‑judge variance (important) + +VLM scoring is noisy and biased. Mitigations baked into the node / recommended: + +1. **Position‑bias swap** — run the judge twice with reference/generated order swapped and + average the per‑axis scores (`swap_eval=True`). Cuts the "first image wins" bias. +2. **Low temperature** (0.0–0.3) + a **fixed rubric** in the system prompt → repeatable scores. +3. **Anchored 0–1 rubric** (0 = unrelated, 0.5 = same category/different details, 1 = near‑identical) so scores are comparable across iterations. +4. **Evidence‑first**: ask the model to state the concrete difference *before* the number; reasoning‑then‑score is measurably more reliable than score‑then‑reasoning. +5. **Average over k T2I seeds** for the *same* prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, **freeze the T2I seed** during the axis search and only vary it once at the end. + +--- + +## 4. The calibrator / controller + +> **Chosen design: the controller is an external CLI agent, not an in‑graph node.** +> The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into +> the `CalibratorPromptReceptor` node, and queues ComfyUI via its HTTP API — one +> `prompt_id` per iteration. See **[AGENT_LOOP.md](AGENT_LOOP.md)** and `agent_bridge.py`. +> The options below describe the *policy* the agent can run. + +Prompt‑Builder is a **deterministic, seeded, combinatorial** generator (it is *not* an +LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis overrides)`** +to maximize `overall_score`. Three controller options, easiest → strongest: + +1. **Greedy per‑axis hill‑climb (start here).** + For each axis with the lowest score, apply the matching `fix_suggestion` as a knob + override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change + if `overall_score` improved, else revert. Loop until ≥ target or no axis improves. + Implementable today with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes. + +2. **Black‑box optimizer over the knob vector.** + Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES / + a simple bandit, objective = `overall_score`. Better for >3–4 interacting axes; needs + a thin Python controller node that holds state across iterations. + +3. **LLM‑in‑the‑loop rewriter.** + Feed `diff_analysis` to a (local) text LLM that proposes the next knob settings (or, + if you move to free‑text prompts, rewrites the prompt). Most flexible, least + reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored. + +**Loop hygiene:** fix resolution/sampler/steps across iterations; freeze T2I seed while +searching; stop on `overall_score ≥ target` (e.g. 0.85) **or** `max_iters`; log every +`(knobs, score, diff)` triple so the search is auditable and resumable. + +--- + +## 5. Concrete build order + +1. **Judge node** (this repo, `nodes/qwen_judge.py`) — load local Qwen3‑VL‑4B abliterated, + take ref+gen, output `overall_score (FLOAT)`, `axis_scores (JSON STRING)`, + `diff_analysis (STRING)`, `raw (STRING)`. ✅ scaffolded. +2. **Wire the loop** in a workflow: Prompt‑Builder → T2I → Judge → Accumulator, using the + SxCP For‑Loop nodes; route `overall_score` into the loop's stop condition. +3. **Controller node** — start with greedy per‑axis hill‑climb that reads `diff_analysis` + and emits knob overrides back into Prompt‑Builder's split control nodes. +4. **Tune the judge** — calibrate the rubric on a handful of known ref/gen pairs; enable + `swap_eval`; pick temperature; decide if you need to step up to 8B/30B‑A3B. + +See [README.md](../README.md) for install/usage of the Judge node. diff --git a/nodes/__init__.py b/nodes/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/nodes/qwen_judge.py b/nodes/qwen_judge.py new file mode 100644 index 0000000..c3beb18 --- /dev/null +++ b/nodes/qwen_judge.py @@ -0,0 +1,418 @@ +""" +Qwen3-VL Image-Similarity Judge node for ComfyUI. + +The "vllm node" of the Prompt Calibrator. It takes a REFERENCE image and a +GENERATED image and asks a local Qwen3-VL model how close the generated image is +to the reference, returning a machine-readable score + per-axis difference +analysis that the calibration controller can act on. + +Reuses the standard transformers Qwen3-VL plumbing (the same approach used by +ComfyUI-QwenVL-MultiImage / ComfyUI_Qwen3-VL-Instruct), but forces strict JSON +output so the result is usable by an automated loop rather than a human reader. + +Default model is the locally converted huihui-ai Qwen3-VL-4B-Instruct +*abliterated* (uncensored) weights, which do not refuse to analyze adult imagery. +""" + +from __future__ import annotations + +import json +import os +import re + +import numpy as np +import torch +from PIL import Image + +# Default to the model already converted on this machine (works out of the box). +DEFAULT_MODEL_PATH = "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16" +DEFAULT_MODEL_PATH_FP8 = "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_fp8" + +# Recommended abliterated upgrades for the RTX 5090 32 GB (latest Qwen VL family). +# Download with: hf download --local-dir , then point model_path at it. +RECOMMENDED_MODELS = { + # Best judge that fits 32 GB. MoE (3B active -> fast). Use precision="nf4" + # (~18 GB) on 32 GB, or the GGUF quants via a GGUF node. transformers class: + # Qwen3VLMoeForConditionalGeneration (auto-detected below). + "30b-a3b": "huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated", + # Easy middle ground: bf16 ~17 GB, no quantization hassle, drop-in here. + "8b": "huihui-ai/Huihui-Qwen3-VL-8B-Instruct-abliterated", + # Lightweight, already local. + "4b": "huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated", +} + +DEFAULT_AXES = "cast, clothing, pose, scene, composition, expression, color_light" + +# Cache loaded (model, processor) keyed by (path, precision) so the loop does not +# reload weights every iteration. +_MODEL_CACHE: dict[tuple[str, str], tuple] = {} + + +def _looks_like_repo_id(s: str) -> bool: + """'org/name' HF repo id, not an absolute/local filesystem path.""" + return ("/" in s) and (" " not in s) and (not os.path.isabs(s)) and (not s.startswith(".")) + + +def _download_target_dir(repo_id: str) -> str: + """Where to put downloaded weights — prefer ComfyUI's models/prompt_generator/.""" + name = repo_id.split("/")[-1] + try: + import folder_paths # available when running inside ComfyUI + base = os.path.join(folder_paths.models_dir, "prompt_generator") + except Exception: + base = os.path.join(os.path.dirname(os.path.dirname(__file__)), "models") + return os.path.join(base, name) + + +def _resolve_model_source(model_path: str, auto_download: bool) -> str: + """Turn model_path (local dir | short alias | HF repo id) into a local dir. + + Downloads from the Hub on first use if needed (and auto_download is on). + """ + # Short alias -> full repo id (e.g. "30b-a3b", "8b", "4b"). + if model_path in RECOMMENDED_MODELS: + model_path = RECOMMENDED_MODELS[model_path] + + if os.path.isdir(model_path): + return model_path + + if _looks_like_repo_id(model_path): + target = _download_target_dir(model_path) + # Already downloaded? (a config.json is enough to trust the local copy) + if os.path.isfile(os.path.join(target, "config.json")): + return target + if not auto_download: + raise FileNotFoundError( + f"[QwenVLImageJudge] '{model_path}' is not downloaded and auto_download is off. " + f"Enable auto_download or pre-fetch it to {target}.") + from huggingface_hub import snapshot_download + print(f"[QwenVLImageJudge] downloading {model_path} -> {target} (first run only, may be large)...") + local = snapshot_download( + repo_id=model_path, + local_dir=target, + # weights + processor/tokenizer/config; skip duplicate GGUF/onnx blobs. + allow_patterns=["*.json", "*.safetensors", "*.txt", "*.model", "merges.txt", "*.py"], + ) + print(f"[QwenVLImageJudge] download complete: {local}") + return local + + # A local path that simply doesn't exist. + raise FileNotFoundError( + f"[QwenVLImageJudge] model_path not found: {model_path}. " + f"Use a local checkpoint dir, a HF repo id (org/name), or an alias " + f"({', '.join(RECOMMENDED_MODELS)}).") + + +def _tensor_to_pil(image: "torch.Tensor") -> Image.Image: + """ComfyUI IMAGE tensor (B,H,W,C float 0..1) -> first-frame PIL.Image (RGB).""" + if image is None: + raise ValueError("Judge node received an empty image input.") + arr = image + if hasattr(arr, "detach"): + arr = arr.detach().cpu().numpy() + arr = np.asarray(arr) + if arr.ndim == 4: # batch -> take first frame + arr = arr[0] + arr = np.clip(arr * 255.0, 0, 255).astype(np.uint8) + if arr.ndim == 2: + arr = np.stack([arr] * 3, axis=-1) + if arr.shape[-1] == 4: # drop alpha + arr = arr[..., :3] + return Image.fromarray(arr, mode="RGB") + + +def _resolve_vl_class(model_path: str): + """Pick the right transformers class. AutoModelForImageTextToText reads the + checkpoint's `architectures` and instantiates the correct dense + (Qwen3VLForConditionalGeneration) or MoE (Qwen3VLMoeForConditionalGeneration) + class automatically — so 4B/8B *and* 30B-A3B all work without branching.""" + try: + from transformers import AutoModelForImageTextToText as _Auto + return _Auto + except ImportError: # pragma: no cover - older transformers + name = model_path.lower() + is_moe = any(t in name for t in ("a3b", "moe", "30b", "235b")) + if is_moe: + from transformers import Qwen3VLMoeForConditionalGeneration as _C + else: + from transformers import Qwen3VLForConditionalGeneration as _C + return _C + + +def _load_model(model_path: str, precision: str): + key = (model_path, precision) + if key in _MODEL_CACHE: + return _MODEL_CACHE[key] + + # Imported lazily so the node can be registered even if transformers is old. + from transformers import AutoProcessor + + _VLModel = _resolve_vl_class(model_path) + load_kwargs = dict(device_map="auto", trust_remote_code=True, low_cpu_mem_usage=True) + + if precision == "nf4": + # 4-bit (bitsandbytes) — lets the 30B-A3B abliterated MoE fit in ~18 GB on 32 GB. + from transformers import BitsAndBytesConfig + load_kwargs["quantization_config"] = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + bnb_4bit_use_double_quant=True, + ) + elif precision == "fp8": + # Pre-quantized FP8 weights: let the checkpoint dictate dtype. + pass + else: + load_kwargs["dtype"] = torch.bfloat16 if precision == "bf16" else torch.float16 + + model = _VLModel.from_pretrained(model_path, **load_kwargs) + model.eval() + processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + _MODEL_CACHE[key] = (model, processor) + return model, processor + + +def _build_system_prompt(axes: list[str]) -> str: + axis_lines = "\n".join(f' "{a}": {{"score": <0..1>, "diff": ""}},' for a in axes) + return ( + "You are a meticulous visual-similarity judge for an image-generation " + "calibration loop. You are shown two images: IMAGE 1 is the REFERENCE " + "(the target) and IMAGE 2 is the GENERATED candidate. Judge how closely " + "the GENERATED image reproduces the REFERENCE.\n\n" + "Score each axis from 0 to 1 using this anchored rubric:\n" + " 0.0 = unrelated; 0.5 = same general category but clearly different " + "details; 1.0 = near-identical.\n" + "For each axis, FIRST note the concrete difference, THEN assign the number.\n\n" + "Reply with STRICT JSON only, no prose, no markdown fences, exactly:\n" + "{\n" + ' "overall_score": <0..1>,\n' + ' "axes": {\n' + f"{axis_lines}\n" + " },\n" + ' "fix_suggestions": ["", ...]\n' + "}\n" + "Phrase every diff and fix in terms of the named axes " + "(cast/clothing/pose/scene/composition/expression/color_light). " + "overall_score must be consistent with the per-axis scores." + ) + + +def _run_once(model, processor, ref_pil, gen_pil, axes, max_new_tokens, temperature): + """One forward pass; returns the raw decoded string.""" + messages = [ + {"role": "system", "content": _build_system_prompt(axes)}, + { + "role": "user", + "content": [ + {"type": "text", "text": "IMAGE 1 = REFERENCE (target):"}, + {"type": "image", "image": ref_pil}, + {"type": "text", "text": "IMAGE 2 = GENERATED candidate:"}, + {"type": "image", "image": gen_pil}, + {"type": "text", "text": "Now return the strict JSON judgement."}, + ], + }, + ] + + text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + inputs = processor(text=[text], images=[ref_pil, gen_pil], return_tensors="pt") + inputs = inputs.to(model.device) + + gen_kwargs = dict(max_new_tokens=max_new_tokens) + if temperature and temperature > 0: + gen_kwargs.update(do_sample=True, temperature=float(temperature)) + else: + gen_kwargs.update(do_sample=False) + + with torch.inference_mode(): + out = model.generate(**inputs, **gen_kwargs) + trimmed = out[:, inputs.input_ids.shape[1]:] + decoded = processor.batch_decode(trimmed, skip_special_tokens=True)[0] + return decoded.strip() + + +def _parse_json(raw: str) -> dict | None: + """Best-effort: pull the first balanced JSON object out of the model output.""" + # Strip code fences if present. + fenced = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", raw, re.DOTALL) + candidate = fenced.group(1) if fenced else None + if candidate is None: + start = raw.find("{") + if start == -1: + return None + depth = 0 + for i in range(start, len(raw)): + if raw[i] == "{": + depth += 1 + elif raw[i] == "}": + depth -= 1 + if depth == 0: + candidate = raw[start:i + 1] + break + if candidate is None: + return None + try: + return json.loads(candidate) + except json.JSONDecodeError: + return None + + +def _merge_swapped(a: dict, b: dict) -> dict: + """Average two judgements (normal + order-swapped) to cut position bias.""" + if not b: + return a + if not a: + return b + out = {"axes": {}, "fix_suggestions": []} + out["overall_score"] = round( + (float(a.get("overall_score", 0)) + float(b.get("overall_score", 0))) / 2.0, 4 + ) + axes = set(a.get("axes", {})) | set(b.get("axes", {})) + for ax in axes: + sa = a.get("axes", {}).get(ax, {}) + sb = b.get("axes", {}).get(ax, {}) + score = (float(sa.get("score", 0)) + float(sb.get("score", 0))) / 2.0 + diff = sa.get("diff") or sb.get("diff") or "" + out["axes"][ax] = {"score": round(score, 4), "diff": diff} + out["fix_suggestions"] = (a.get("fix_suggestions") or []) + (b.get("fix_suggestions") or []) + return out + + +def _report_base_dir(report_dir: str) -> str: + if report_dir: + return report_dir + try: + import folder_paths + return os.path.join(folder_paths.get_output_directory(), "calibrator") + except Exception: + return os.path.join(os.path.dirname(os.path.dirname(__file__)), "output", "calibrator") + + +def _write_report(report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used): + """Persist the analysis so the external CLI agent can read it after a queue. + + Writes a per-run file plus a stable `latest.json` the agent can always poll. + Returns the per-run file path (or "" on failure).""" + base = _report_base_dir(report_dir) + try: + os.makedirs(base, exist_ok=True) + except OSError as e: + print(f"[QwenVLImageJudge] could not create report dir {base}: {e}") + return "" + + payload = { + "run_tag": run_tag, + "overall_score": round(float(overall), 4), + "axes": (merged or {}).get("axes", {}), + "fix_suggestions": (merged or {}).get("fix_suggestions", []), + "diff_analysis": diff_analysis, + "prompt_used": prompt_used, + "raw": raw_all, + } + tag = re.sub(r"[^A-Za-z0-9._-]", "_", run_tag) if run_tag else "latest" + run_path = os.path.join(base, f"calib_{tag}.json") + for path in (run_path, os.path.join(base, "latest.json")): + try: + with open(path, "w", encoding="utf-8") as f: + json.dump(payload, f, ensure_ascii=False, indent=2) + except OSError as e: + print(f"[QwenVLImageJudge] failed writing report {path}: {e}") + # A markdown sibling is handy for the agent to read as plain text. + try: + md = (f"# Calibration analysis ({tag})\n\n" + f"**overall_score:** {payload['overall_score']}\n\n" + f"**prompt_used:**\n\n{prompt_used or '(not provided)'}\n\n" + f"## per-axis\n\n{diff_analysis}\n") + with open(os.path.join(base, f"calib_{tag}.md"), "w", encoding="utf-8") as f: + f.write(md) + except OSError: + pass + return run_path + + +class QwenVLImageJudge: + """ComfyUI node: score how close a generated image is to a reference.""" + + CATEGORY = "prompt_calibrator" + FUNCTION = "judge" + RETURN_TYPES = ("FLOAT", "STRING", "STRING", "STRING", "STRING") + RETURN_NAMES = ("overall_score", "axis_scores_json", "diff_analysis", "raw", "report_path") + + @classmethod + def INPUT_TYPES(cls): + return { + "required": { + "reference_image": ("IMAGE",), + "generated_image": ("IMAGE",), + "model_path": ("STRING", {"default": DEFAULT_MODEL_PATH}), + "precision": (["bf16", "fp16", "fp8", "nf4"], {"default": "bf16"}), + "axes": ("STRING", {"default": DEFAULT_AXES, "multiline": True}), + "max_new_tokens": ("INT", {"default": 512, "min": 64, "max": 4096}), + "temperature": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.5, "step": 0.05}), + "swap_eval": ("BOOLEAN", {"default": True}), + }, + "optional": { + "keep_loaded": ("BOOLEAN", {"default": True}), + "auto_download": ("BOOLEAN", {"default": True}), + # The agent reads the analysis from these files after each queue. + "report_dir": ("STRING", {"default": ""}), + "run_tag": ("STRING", {"default": ""}), + "prompt_used": ("STRING", {"default": "", "multiline": True}), + }, + } + + def judge(self, reference_image, generated_image, model_path, precision, axes, + max_new_tokens, temperature, swap_eval, keep_loaded=True, auto_download=True, + report_dir="", run_tag="", prompt_used=""): + axis_list = [a.strip() for a in re.split(r"[,\n]", axes) if a.strip()] + if not axis_list: + axis_list = [a.strip() for a in DEFAULT_AXES.split(",")] + + try: + resolved_path = _resolve_model_source(model_path, auto_download) + except Exception as e: # missing model / download failure -> surface as score 0 + msg = str(e) + print(msg) + return (0.0, "{}", msg, msg) + + ref_pil = _tensor_to_pil(reference_image) + gen_pil = _tensor_to_pil(generated_image) + + model, processor = _load_model(resolved_path, precision) + + raw1 = _run_once(model, processor, ref_pil, gen_pil, axis_list, max_new_tokens, temperature) + parsed1 = _parse_json(raw1) or {} + + raw_all = raw1 + merged = parsed1 + if swap_eval: + # Swap which image is called REFERENCE to average out position bias. + raw2 = _run_once(model, processor, gen_pil, ref_pil, axis_list, max_new_tokens, temperature) + parsed2 = _parse_json(raw2) or {} + merged = _merge_swapped(parsed1, parsed2) + raw_all = raw1 + "\n--- SWAPPED ---\n" + raw2 + + if not keep_loaded: + _MODEL_CACHE.pop((resolved_path, precision), None) + del model + torch.cuda.empty_cache() + + overall = float(merged.get("overall_score", 0.0)) if merged else 0.0 + axis_scores = json.dumps(merged.get("axes", {}), ensure_ascii=False, indent=2) if merged else "{}" + + # Human/controller-readable diff summary. + diff_lines = [] + for ax, info in (merged.get("axes", {}) if merged else {}).items(): + diff_lines.append(f"- {ax}: {info.get('score', 0):.2f} — {info.get('diff', '')}") + fixes = merged.get("fix_suggestions", []) if merged else [] + if fixes: + diff_lines.append("fixes: " + "; ".join(str(f) for f in fixes)) + diff_analysis = "\n".join(diff_lines) if diff_lines else "(no parseable judgement)" + + report_path = _write_report( + report_dir, run_tag, overall, merged, diff_analysis, raw_all, prompt_used) + + return (round(overall, 4), axis_scores, diff_analysis, raw_all, report_path) + + +NODE_CLASS_MAPPINGS = {"QwenVLImageJudge": QwenVLImageJudge} +NODE_DISPLAY_NAME_MAPPINGS = {"QwenVLImageJudge": "Qwen3-VL Image Judge (Calibrator)"} diff --git a/nodes/receptor.py b/nodes/receptor.py new file mode 100644 index 0000000..7f6cad4 --- /dev/null +++ b/nodes/receptor.py @@ -0,0 +1,66 @@ +""" +Calibrator Prompt Receptor node. + +The injection point for the external CLI-agent controller. The agent overrides +this node's widget values per queue via the ComfyUI HTTP API (`POST /prompt`, +override by node id), or — as a fallback — points `source_file` at a JSON file +the agent writes. Its outputs feed the T2I sampler in place of a static prompt. + +This is the "receptor in ComfyUI" in the loop: + agent -> (sets prompt here) -> T2I -> Qwen3-VL Judge -> analysis -> agent +""" + +from __future__ import annotations + +import json +import os + + +class CalibratorPromptReceptor: + CATEGORY = "prompt_calibrator" + FUNCTION = "emit" + RETURN_TYPES = ("STRING", "STRING", "INT") + RETURN_NAMES = ("prompt", "negative", "seed") + + @classmethod + def INPUT_TYPES(cls): + return { + "required": { + "prompt": ("STRING", {"default": "", "multiline": True}), + "negative": ("STRING", {"default": "", "multiline": True}), + "seed": ("INT", {"default": 0, "min": 0, "max": 0x7FFFFFFFFFFFFFFF}), + }, + "optional": { + # If set and present, a JSON file {prompt, negative, seed} overrides + # the widgets above. Lets the agent drive the loop file-first if it + # prefers that to the HTTP API. + "source_file": ("STRING", {"default": ""}), + }, + } + + @classmethod + def IS_CHANGED(cls, prompt, negative, seed, source_file=""): + # Re-run whenever the effective inputs change: widget values (API override) + # OR the source file's mtime (file-driven mode). + mtime = "" + if source_file and os.path.isfile(source_file): + mtime = str(os.path.getmtime(source_file)) + return f"{prompt}|{negative}|{seed}|{source_file}|{mtime}" + + def emit(self, prompt, negative, seed, source_file=""): + if source_file and os.path.isfile(source_file): + try: + with open(source_file, "r", encoding="utf-8") as f: + data = json.load(f) + prompt = data.get("prompt", prompt) + negative = data.get("negative", negative) + seed = int(data.get("seed", seed)) + except (OSError, ValueError, json.JSONDecodeError) as e: + print(f"[CalibratorPromptReceptor] could not read {source_file}: {e}") + return (prompt, negative, int(seed)) + + +NODE_CLASS_MAPPINGS = {"CalibratorPromptReceptor": CalibratorPromptReceptor} +NODE_DISPLAY_NAME_MAPPINGS = { + "CalibratorPromptReceptor": "SxCP External Prompt (Receptor)" +} diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..a4a98fd --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,19 @@ +[project] +name = "comfyui-prompt-calibratror" +description = "VLM-as-judge prompt calibration loop: Qwen3-VL scores generated vs reference images to calibrate the prompt-generation method." +version = "0.1.0" +license = { text = "MIT" } +requires-python = ">=3.10" +dependencies = [ + "transformers>=4.57.0", + "pillow", + "numpy", +] + +[project.urls] +Repository = "https://github.com/ethanfel/ComfyUI-Prompt-Calibratror" + +[tool.comfy] +PublisherId = "ethanfel" +DisplayName = "ComfyUI-Prompt-Calibratror" +Icon = "" diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..03a96b8 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,10 @@ +# Qwen3-VL needs transformers >= 4.57 (the version the local checkpoint was saved with). +transformers>=4.57.0 +huggingface_hub # auto-download of models by repo id / alias +torch +pillow +numpy +# for precision=nf4 (4-bit) — needed to run the 30B-A3B abliterated judge on 32 GB: +bitsandbytes +# optional, for faster attention on the RTX 5090: +# flash-attn diff --git a/workflow/workflow_api.json b/workflow/workflow_api.json new file mode 100644 index 0000000..9e4c563 --- /dev/null +++ b/workflow/workflow_api.json @@ -0,0 +1,82 @@ +{ + "4": { + "class_type": "CheckpointLoaderSimple", + "inputs": { "ckpt_name": "waiIllustriousSDXL_v160.safetensors" }, + "_meta": { "title": "Load Checkpoint (swap for your T2I)" } + }, + "10": { + "class_type": "CalibratorPromptReceptor", + "inputs": { + "prompt": "a photo of a woman, casual outfit, indoors", + "negative": "blurry, deformed, lowres, extra limbs", + "seed": 12345, + "source_file": "" + }, + "_meta": { "title": "SxCP External Prompt (Receptor)" } + }, + "6": { + "class_type": "CLIPTextEncode", + "inputs": { "text": ["10", 0], "clip": ["4", 1] }, + "_meta": { "title": "Positive (from receptor)" } + }, + "7": { + "class_type": "CLIPTextEncode", + "inputs": { "text": ["10", 1], "clip": ["4", 1] }, + "_meta": { "title": "Negative (from receptor)" } + }, + "5": { + "class_type": "EmptyLatentImage", + "inputs": { "width": 1024, "height": 1024, "batch_size": 1 }, + "_meta": { "title": "Empty Latent" } + }, + "3": { + "class_type": "KSampler", + "inputs": { + "model": ["4", 0], + "positive": ["6", 0], + "negative": ["7", 0], + "latent_image": ["5", 0], + "seed": ["10", 2], + "steps": 28, + "cfg": 5.5, + "sampler_name": "euler", + "scheduler": "normal", + "denoise": 1.0 + }, + "_meta": { "title": "KSampler (seed from receptor)" } + }, + "8": { + "class_type": "VAEDecode", + "inputs": { "samples": ["3", 0], "vae": ["4", 2] }, + "_meta": { "title": "VAE Decode" } + }, + "9": { + "class_type": "SaveImage", + "inputs": { "images": ["8", 0], "filename_prefix": "calibrator/gen" }, + "_meta": { "title": "Save Generated" } + }, + "11": { + "class_type": "LoadImage", + "inputs": { "image": "reference.png" }, + "_meta": { "title": "Reference Image (put in ComfyUI/input/)" } + }, + "12": { + "class_type": "QwenVLImageJudge", + "inputs": { + "reference_image": ["11", 0], + "generated_image": ["8", 0], + "model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16", + "precision": "bf16", + "axes": "cast, clothing, pose, scene, composition, expression, color_light", + "max_new_tokens": 512, + "temperature": 0.0, + "swap_eval": true, + "keep_loaded": true, + "auto_download": true, + "report_dir": "/media/p5/Comfyui/output/calibrator", + "run_tag": "", + "prompt_used": "" + }, + "_meta": { "title": "Qwen3-VL Image Judge (Calibrator)" } + } +}