Initial commit: VLM-as-judge prompt calibration loop

Qwen3-VL image-similarity judge node, external-prompt receptor node, agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/ calibration-policy docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:15:56 +02:00
commit 95198a15b5
13 changed files with 1294 additions and 0 deletions
@@ -0,0 +1,87 @@
+# Agent-driven calibration loop
+
+The controller is an **external CLI agent**, not an in-graph node. ComfyUI is the
+execution environment (prompt receptor → T2I → VLM judge); the agent is the brain that
+reads the analysis, calibrates the prompt generator, and queues the next iteration.
+
+```
+ CLI AGENT (controller / brain)                 COMFYUI (execution, running with --listen)
+ ───────────────────────────────                ──────────────────────────────────────────
+  1. build/calibrate a prompt
+  2. agent_bridge.py --prompt ... ───POST /prompt──►  CalibratorPromptReceptor (injection point)
+                                                          │ prompt / negative / seed
+                                                          ▼
+                                                       T2I (SDXL / Flux / Krea2)
+                                                          │ generated image
+                                                          ▼
+                                                       Qwen3-VL Image Judge
+                                                          │ writes calib_<tag>.json + latest.json
+  3. poll /history/{id} (bridge does this)  ◄───────────┘
+  4. read report JSON (overall_score,
+     per-axis diffs, fix_suggestions)
+  5. adjust Prompt-Builder knobs / prompt
+  └──► go to 1 until overall_score ≥ target
+```
+
+## Why API-driven, not file-watch
+
+A passive "watch a file and auto-run" receptor is fragile in ComfyUI (no native file
+watcher / auto-queue, and prompt↔image↔analysis can desync). Driving `POST /prompt`
+instead makes every iteration **synchronous and ordered** — one `prompt_id` ties the
+prompt, the image, and the analysis together. The receptor node is still the clean
+injection point; the agent just overrides its widgets per queue. (The receptor *also*
+supports a `source_file` for file-first workflows if you ever want it.)
+
+## The three pieces
+
+| Piece | Role |
+|---|---|
+| `CalibratorPromptReceptor` (`SxCP External Prompt (Receptor)`) | Stable node the agent injects `prompt/negative/seed` into. Feeds the sampler. |
+| `QwenVLImageJudge` (`Qwen3-VL Image Judge (Calibrator)`) | Scores generated vs reference; writes `calib_<run_tag>.json`, `latest.json`, `calib_<run_tag>.md` to `report_dir`. |
+| `agent_bridge.py` | One CLI call = one iteration: inject prompt → queue → wait → print the analysis JSON to stdout. Stdlib only. |
+
+## One iteration (what the agent runs)
+
+```bash
+python agent_bridge.py \
+  --server 127.0.0.1:8188 \
+  --workflow workflow_api.json \
+  --prompt   "1 woman, red lingerie, bedroom, full body, warm rim light" \
+  --negative "blurry, deformed" \
+  --seed 12345 \
+  --run-tag  iter003 \
+  --analysis-dir /media/p5/Comfyui/output/calibrator
+```
+
+Stdout (captured by the agent) is the report:
+
+```json
+{
+  "run_tag": "iter003",
+  "overall_score": 0.62,
+  "axes": {
+    "pose":     {"score": 0.40, "diff": "ref standing, gen seated"},
+    "clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"}
+  },
+  "fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"],
+  "prompt_used": "1 woman, red lingerie, ...",
+  "_prompt_id": "…", "_report_path": "…/calib_iter003.json"
+}
+```
+
+## Agent calibration policy (suggested)
+
+The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the
+`fix_suggestions`, regenerates, and keeps changes that raise `overall_score`
+(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so
+the score reflects the prompt, not sampler noise; vary the seed only once you're near the
+target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every
+`(prompt, knobs, score)` so the search is auditable/resumable.
+
+## Setup checklist
+
+1. Run ComfyUI with `--listen` (so the bridge can POST). Install this node pack.
+2. Build a workflow with: `CalibratorPromptReceptor` → (Prompt-Builder formatting, optional) → T2I → `QwenVLImageJudge` (feed the **reference** image into `reference_image`, the T2I output into `generated_image`).
+3. Set the Judge's `report_dir` to a known path; pass the same path as `--analysis-dir`.
+4. Export the workflow in **API format** (`workflow_api.json`).
+5. Drive it from the agent with `agent_bridge.py`, once per iteration.