Initial commit: VLM-as-judge prompt calibration loop
Qwen3-VL image-similarity judge node, external-prompt receptor node, agent_bridge CLI, example SDXL workflow, and methodology/agent-loop/ calibration-policy docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,87 @@
|
||||
# Agent-driven calibration loop
|
||||
|
||||
The controller is an **external CLI agent**, not an in-graph node. ComfyUI is the
|
||||
execution environment (prompt receptor → T2I → VLM judge); the agent is the brain that
|
||||
reads the analysis, calibrates the prompt generator, and queues the next iteration.
|
||||
|
||||
```
|
||||
CLI AGENT (controller / brain) COMFYUI (execution, running with --listen)
|
||||
─────────────────────────────── ──────────────────────────────────────────
|
||||
1. build/calibrate a prompt
|
||||
2. agent_bridge.py --prompt ... ───POST /prompt──► CalibratorPromptReceptor (injection point)
|
||||
│ prompt / negative / seed
|
||||
▼
|
||||
T2I (SDXL / Flux / Krea2)
|
||||
│ generated image
|
||||
▼
|
||||
Qwen3-VL Image Judge
|
||||
│ writes calib_<tag>.json + latest.json
|
||||
3. poll /history/{id} (bridge does this) ◄───────────┘
|
||||
4. read report JSON (overall_score,
|
||||
per-axis diffs, fix_suggestions)
|
||||
5. adjust Prompt-Builder knobs / prompt
|
||||
└──► go to 1 until overall_score ≥ target
|
||||
```
|
||||
|
||||
## Why API-driven, not file-watch
|
||||
|
||||
A passive "watch a file and auto-run" receptor is fragile in ComfyUI (no native file
|
||||
watcher / auto-queue, and prompt↔image↔analysis can desync). Driving `POST /prompt`
|
||||
instead makes every iteration **synchronous and ordered** — one `prompt_id` ties the
|
||||
prompt, the image, and the analysis together. The receptor node is still the clean
|
||||
injection point; the agent just overrides its widgets per queue. (The receptor *also*
|
||||
supports a `source_file` for file-first workflows if you ever want it.)
|
||||
|
||||
## The three pieces
|
||||
|
||||
| Piece | Role |
|
||||
|---|---|
|
||||
| `CalibratorPromptReceptor` (`SxCP External Prompt (Receptor)`) | Stable node the agent injects `prompt/negative/seed` into. Feeds the sampler. |
|
||||
| `QwenVLImageJudge` (`Qwen3-VL Image Judge (Calibrator)`) | Scores generated vs reference; writes `calib_<run_tag>.json`, `latest.json`, `calib_<run_tag>.md` to `report_dir`. |
|
||||
| `agent_bridge.py` | One CLI call = one iteration: inject prompt → queue → wait → print the analysis JSON to stdout. Stdlib only. |
|
||||
|
||||
## One iteration (what the agent runs)
|
||||
|
||||
```bash
|
||||
python agent_bridge.py \
|
||||
--server 127.0.0.1:8188 \
|
||||
--workflow workflow_api.json \
|
||||
--prompt "1 woman, red lingerie, bedroom, full body, warm rim light" \
|
||||
--negative "blurry, deformed" \
|
||||
--seed 12345 \
|
||||
--run-tag iter003 \
|
||||
--analysis-dir /media/p5/Comfyui/output/calibrator
|
||||
```
|
||||
|
||||
Stdout (captured by the agent) is the report:
|
||||
|
||||
```json
|
||||
{
|
||||
"run_tag": "iter003",
|
||||
"overall_score": 0.62,
|
||||
"axes": {
|
||||
"pose": {"score": 0.40, "diff": "ref standing, gen seated"},
|
||||
"clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"}
|
||||
},
|
||||
"fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"],
|
||||
"prompt_used": "1 woman, red lingerie, ...",
|
||||
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
||||
}
|
||||
```
|
||||
|
||||
## Agent calibration policy (suggested)
|
||||
|
||||
The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the
|
||||
`fix_suggestions`, regenerates, and keeps changes that raise `overall_score`
|
||||
(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so
|
||||
the score reflects the prompt, not sampler noise; vary the seed only once you're near the
|
||||
target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every
|
||||
`(prompt, knobs, score)` so the search is auditable/resumable.
|
||||
|
||||
## Setup checklist
|
||||
|
||||
1. Run ComfyUI with `--listen` (so the bridge can POST). Install this node pack.
|
||||
2. Build a workflow with: `CalibratorPromptReceptor` → (Prompt-Builder formatting, optional) → T2I → `QwenVLImageJudge` (feed the **reference** image into `reference_image`, the T2I output into `generated_image`).
|
||||
3. Set the Judge's `report_dir` to a known path; pass the same path as `--analysis-dir`.
|
||||
4. Export the workflow in **API format** (`workflow_api.json`).
|
||||
5. Drive it from the agent with `agent_bridge.py`, once per iteration.
|
||||
@@ -0,0 +1,135 @@
|
||||
# Calibration policy — the agent's playbook
|
||||
|
||||
This is the instruction set the **external CLI agent** (the controller) follows each
|
||||
iteration. Paste the "Agent system prompt" block into your agent, give it the workflow
|
||||
path + reference image + target score, and let it loop.
|
||||
|
||||
The agent calibrates by reasoning over the **Prompt‑Builder axes** and editing a
|
||||
structured *axis state*, then **rendering that state to a prompt string** that it injects
|
||||
into the `CalibratorPromptReceptor`. This keeps the reasoning axis‑aware while staying
|
||||
compatible with the flat‑string receptor. (If you later switch the receptor to carry a
|
||||
structured config, the same axis state maps straight onto Prompt‑Builder's split control
|
||||
nodes.)
|
||||
|
||||
---
|
||||
|
||||
## Axis state (the agent's working memory)
|
||||
|
||||
```json
|
||||
{
|
||||
"cast": "1 woman, mid-20s, athletic",
|
||||
"clothing": "red lace lingerie",
|
||||
"pose": "standing, hand on hip",
|
||||
"scene": "dimly lit bedroom",
|
||||
"composition": "full-body shot, slight low angle",
|
||||
"expression": "soft smile, eye contact",
|
||||
"color_light": "warm rim light, shallow depth of field",
|
||||
"quality": "photorealistic, high detail",
|
||||
"negative": "blurry, deformed, lowres, extra limbs",
|
||||
"seed": 12345
|
||||
}
|
||||
```
|
||||
|
||||
These keys are exactly the Judge's scoring axes. `quality`/`negative`/`seed` are carried
|
||||
but not scored. Render order (subject → wardrobe → action → setting → framing → affect →
|
||||
light → quality):
|
||||
|
||||
```
|
||||
prompt = join_nonempty([cast, clothing, pose, scene, composition, expression, color_light, quality])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Per‑iteration algorithm (greedy per‑axis hill‑climb)
|
||||
|
||||
```
|
||||
best_score = -1 ; best_state = initial_state ; stale = 0 ; i = 0
|
||||
loop:
|
||||
i += 1
|
||||
prompt = render(state)
|
||||
report = run agent_bridge.py --prompt prompt --negative state.negative
|
||||
--seed state.seed --run-tag iter{i}
|
||||
--workflow wf.json --analysis-dir <report_dir>
|
||||
score = report.overall_score
|
||||
if score >= TARGET: # e.g. 0.85
|
||||
stop("converged", state, score)
|
||||
if score > best_score:
|
||||
best_score = score ; best_state = state ; stale = 0
|
||||
else:
|
||||
stale += 1
|
||||
state = best_state # revert: undo the change that didn't help
|
||||
if stale >= PATIENCE or i >= MAX_ITERS: # e.g. PATIENCE=4, MAX_ITERS=25
|
||||
stop("plateau/budget", best_state, best_score)
|
||||
|
||||
# choose the next single edit:
|
||||
worst_axis = axis with lowest per-axis score in report.axes
|
||||
edit = map_fix_to_axis(report.fix_suggestions, worst_axis) # apply the model's suggestion
|
||||
state = apply(best_state, worst_axis, edit) # change ONE axis only
|
||||
```
|
||||
|
||||
### Rules that matter
|
||||
|
||||
1. **Change one axis per iteration.** One edit = clean attribution of the score delta.
|
||||
Only batch two edits when two axes score very low *and* are clearly independent.
|
||||
2. **Freeze `seed` while searching axes.** The score must reflect the *prompt*, not
|
||||
sampler noise. Vary the seed only after you've converged, to confirm robustness.
|
||||
3. **Always edit from `best_state`, not the last (possibly worse) state** — that's the
|
||||
"revert on no improvement" step. Prevents drifting down a bad path.
|
||||
4. **Target the lowest‑scoring axis first**, applying the Judge's matching
|
||||
`fix_suggestion`. If a suggestion doesn't help after a try, pick an alternative value
|
||||
for that axis before moving on.
|
||||
5. **Near the margin, don't over‑trust one reading.** `swap_eval` already averages two
|
||||
orderings; if two candidates are within ~0.03, re‑run each on a second seed and compare
|
||||
averages before committing.
|
||||
6. **Detect gaming/oscillation.** If scores bounce without net gain, reduce edit size
|
||||
(smaller, more specific wording changes) and re‑anchor on `best_state`.
|
||||
7. **Log every step**: `(iter, axis_changed, old→new value, prompt, overall_score, per‑axis)`.
|
||||
The run must be auditable and resumable.
|
||||
|
||||
### Mapping `fix_suggestions` → axes
|
||||
|
||||
The Judge phrases fixes in axis vocabulary ("set pose=standing", "add lace trim to
|
||||
clothing", "warmer lighting"). Match by keyword to the axis key; if a fix is ambiguous,
|
||||
attribute it to the lowest‑scoring axis it plausibly affects.
|
||||
|
||||
---
|
||||
|
||||
## Worked example
|
||||
|
||||
```
|
||||
iter1 prompt="1 woman, casual outfit, indoors, ..." score=0.41
|
||||
axes: scene 0.30 (worst) — "ref bedroom, gen kitchen"
|
||||
fix: "set scene to a dim bedroom"
|
||||
iter2 edit scene→"dimly lit bedroom" score=0.58 (kept)
|
||||
axes: pose 0.35 (worst) — "ref standing, gen seated"
|
||||
iter3 edit pose→"standing, hand on hip" score=0.71 (kept)
|
||||
axes: color_light 0.50 (worst) — "ref warm, gen flat"
|
||||
iter4 edit color_light→"warm rim light" score=0.69 (worse → revert)
|
||||
iter5 edit color_light→"warm golden hour glow" score=0.83 (kept)
|
||||
axes: clothing 0.78 (worst) — "gen lacks lace detail"
|
||||
iter6 edit clothing→"red lace lingerie with trim" score=0.88 ≥ target → STOP
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent system prompt (paste into your CLI agent)
|
||||
|
||||
> You are the controller for a local image prompt calibrator. Goal: make a generated
|
||||
> image match a reference image, measured by a Qwen3‑VL judge that scores 7 axes
|
||||
> (cast, clothing, pose, scene, composition, expression, color_light) from 0–1.
|
||||
>
|
||||
> You hold an **axis state** (JSON, keys above). Each turn you: (1) render the state to a
|
||||
> prompt string in the order cast→clothing→pose→scene→composition→expression→color_light→
|
||||
> quality; (2) run `python agent_bridge.py --workflow <wf> --prompt "<rendered>"
|
||||
> --negative "<state.negative>" --seed <state.seed> --run-tag iter<N> --analysis-dir
|
||||
> <report_dir>`; (3) read the printed JSON report.
|
||||
>
|
||||
> Then apply greedy per‑axis hill‑climb: keep the change only if `overall_score` improved,
|
||||
> else revert to the best state; pick the **lowest‑scoring axis** and apply the Judge's
|
||||
> matching `fix_suggestion` as a **single** edit. Keep the seed fixed while searching.
|
||||
> Stop when `overall_score ≥ TARGET` (default 0.85), or after PATIENCE=4 non‑improving
|
||||
> iterations, or MAX_ITERS=25. Log every step as a table and report the best prompt + score.
|
||||
>
|
||||
> Never change more than one axis at a time unless two axes are both very low and clearly
|
||||
> independent. Never trust a single near‑margin reading — re‑run on a second seed when two
|
||||
> candidates are within 0.03.
|
||||
@@ -0,0 +1,198 @@
|
||||
# Local Prompt Calibrator — Methodology
|
||||
|
||||
> Goal: a **fully local** ComfyUI feedback loop where a vision‑language model (VLM)
|
||||
> scores how close a *generated* image is to a *reference* image, and that score +
|
||||
> a structured difference analysis is used to **calibrate the prompt‑generation
|
||||
> method** ([ComfyUI‑Prompt‑Builder](../../ComfyUI-Prompt-Builder), the "SxCP" nodes)
|
||||
> until the generated image matches the reference.
|
||||
|
||||
---
|
||||
|
||||
## 1. The loop at a glance
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ REFERENCE image (the target look) │
|
||||
└───────────────┬──────────────────────────────┘
|
||||
│
|
||||
┌────────────────────▼────────────────┐ calibration deltas
|
||||
│ Prompt-Builder (SxCP) ── "method" │◄──── (axis nudges / knob
|
||||
│ seeded pools + profile knobs │ overrides / seed move)
|
||||
└────────────────────┬────────────────┘
|
||||
│ prompt + negative
|
||||
┌────────────────────▼────────────────┐
|
||||
│ T2I model (SDXL / Flux / Krea2) │ ← fix the sampler seed while
|
||||
└────────────────────┬────────────────┘ searching the prompt axes
|
||||
│ generated image
|
||||
┌────────────────────▼──────────────────────────────────┐
|
||||
│ Qwen3-VL JUDGE node ── the "vllm node" │
|
||||
│ in : reference + generated │
|
||||
│ out: overall_score 0..1 │
|
||||
│ per-axis scores (cast, clothing, pose, scene, │
|
||||
│ composition, expression, color/lighting) │
|
||||
│ diff_analysis (JSON: what's off + how to fix, │
|
||||
│ phrased in Prompt-Builder axis vocabulary) │
|
||||
└────────────────────┬──────────────────────────────────┘
|
||||
│ score + diffs
|
||||
┌────────────────────▼────────────────┐
|
||||
│ CALIBRATOR / controller │
|
||||
│ - accumulate per-axis scores │
|
||||
│ - map diffs → axis adjustments │
|
||||
│ - update Prompt-Builder knobs │
|
||||
│ - stop when overall_score ≥ target │
|
||||
│ or max iterations reached │
|
||||
└──────────────────────────────────────┘
|
||||
```
|
||||
|
||||
The novel piece is the **Judge node**. Off‑the‑shelf Qwen‑VL nodes emit free text;
|
||||
a calibrator needs a **machine‑readable score + per‑axis diffs** so the controller
|
||||
can act on them. That is what `nodes/qwen_judge.py` in this repo provides.
|
||||
|
||||
---
|
||||
|
||||
## 2. The VLLM node — what to reuse
|
||||
|
||||
You already have the model converted locally:
|
||||
|
||||
```
|
||||
/media/p5/qwen3vl_4b_abliterated_comfy_convert/
|
||||
├── hf_bf16/ ← huihui-ai Qwen3-VL-4B-Instruct **abliterated** (uncensored), bf16
|
||||
└── hf_fp8/ ← same model, FP8 (≈4–5 GB, trivially fits the RTX 5090 32 GB)
|
||||
```
|
||||
|
||||
The **abliterated** variant matters: stock Qwen3‑VL will often refuse to "describe or
|
||||
analyze" adult imagery, which would break the loop. huihui‑ai removed the text‑side
|
||||
refusal direction, so it scores NSFW reference/generated pairs without bailing.
|
||||
|
||||
### Reusable ComfyUI nodes (pick one as the plumbing base)
|
||||
|
||||
| Repo | Backend | Multi‑image | Local path | Notes |
|
||||
|---|---|---|---|---|
|
||||
| **[hardik-uppal/ComfyUI-QwenVL-MultiImage](https://github.com/hardik-uppal/ComfyUI-QwenVL-MultiImage)** | transformers | ✅ `images` + `images_batch_2/3` | needs tiny tweak | **Best base** — built for "compare these images, describe the differences"; supports FP16 / 8‑bit / 4‑bit **and pre‑quantized FP8** (matches your `hf_fp8`). |
|
||||
| [IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct](https://github.com/IuvenisSapiens/ComfyUI_Qwen3-VL-Instruct) | transformers | ✅ multi‑image query | HF download | Clean native Qwen3‑VL‑Instruct integration. |
|
||||
| [jren712/ComfyUI-QwenVL-abliterated](https://github.com/jren712/ComfyUI-QwenVL-abliterated) | transformers | ✅ | abliterated‑oriented | Fork tuned for the abliterated weights. |
|
||||
| [1038lab/ComfyUI-QwenVL](https://github.com/1038lab/ComfyUI-QwenVL) | **GGUF** (llama.cpp) | ✅ | local GGUF | Use only if you want GGUF; bf16 4B on 32 GB doesn't need it. |
|
||||
|
||||
**Recommendation:** don't run any of them *as‑is* for the loop — they only output text.
|
||||
Instead reuse their **model‑load + `apply_chat_template` + `generate`** plumbing inside
|
||||
a purpose‑built **Judge node** (this repo) that forces structured JSON output. The
|
||||
`ComfyUI-QwenVL-MultiImage` loader is the closest template (it already handles two
|
||||
image batches + FP8).
|
||||
|
||||
### Model sizing on 32 GB (RTX 5090) — abliterated, latest Qwen VL
|
||||
|
||||
As of June 2026 the **latest Qwen VL family is Qwen3‑VL** (Qwen3.5‑VL shipped early
|
||||
2026, but abliterated builds of it are **text‑only so far** — no uncensored
|
||||
Qwen3.5‑*VL* yet). So "latest + uncensored + fits 32 GB" = **Qwen3‑VL‑30B‑A3B abliterated**.
|
||||
All rows below are huihui‑ai abliterated (uncensored) weights:
|
||||
|
||||
| Model (abliterated) | Best precision on 32 GB | ~VRAM | Verdict |
|
||||
|---|---|---|---|
|
||||
| **Qwen3‑VL‑30B‑A3B‑Instruct** ([HF](https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated)) | **nf4 (4‑bit)** or GGUF Q4_K_M | ~18 GB | **Best judge that fits.** MoE → only 3B active, so it's fast despite 30B total. transformers class `Qwen3VLMoeForConditionalGeneration` (auto‑detected by the node). |
|
||||
| Qwen3‑VL‑8B‑Instruct ([HF](https://huggingface.co/huihui-ai)) | bf16 | ~17 GB | Easy middle ground, no quantization. Clearly better than 4B; drop‑in for the judge node. |
|
||||
| Qwen3‑VL‑4B‑Instruct (already local) | fp8 / bf16 | ~5 / ~9 GB | Lightweight fallback / fast iteration. |
|
||||
|
||||
**Gemma alternative:** Gemma‑3‑27B‑it (abliterated, 4‑bit ~16 GB) is a solid different
|
||||
visual prior if you want a second opinion, but the Krea2 text encoder + Prompt‑Builder
|
||||
are already Qwen‑aligned, so staying on Qwen3‑VL keeps the vocabulary consistent.
|
||||
|
||||
Download an upgrade and point the node's `model_path` at it:
|
||||
```bash
|
||||
hf download huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated \
|
||||
--local-dir /media/p5/models/Qwen3-VL-30B-A3B-abliterated
|
||||
# then in the Judge node: model_path=<that dir>, precision=nf4
|
||||
```
|
||||
|
||||
Practical note: at nf4 the 30B judge (~18 GB) and an SDXL/Flux T2I model can't always
|
||||
co‑reside — run them as **separate queue steps** and let ComfyUI unload between; the loop
|
||||
is sequential anyway. The 8B bf16 judge co‑resides more easily.
|
||||
|
||||
---
|
||||
|
||||
## 3. Scoring rubric (what the VLM actually returns)
|
||||
|
||||
The judge prompts Qwen3‑VL to return **strict JSON** with one overall score and a score
|
||||
per axis, where the axes mirror what Prompt‑Builder can control. This is what makes the
|
||||
diff *actionable* instead of generic prose.
|
||||
|
||||
```json
|
||||
{
|
||||
"overall_score": 0.0,
|
||||
"axes": {
|
||||
"cast": {"score": 0.0, "diff": "ref has 1 woman, gen has 2"},
|
||||
"clothing": {"score": 0.0, "diff": "ref lingerie vs gen nude"},
|
||||
"pose": {"score": 0.0, "diff": "ref standing vs gen seated"},
|
||||
"scene": {"score": 0.0, "diff": "ref bedroom vs gen outdoor"},
|
||||
"composition": {"score": 0.0, "diff": "ref full body vs gen close-up"},
|
||||
"expression": {"score": 0.0, "diff": "ref smiling vs gen neutral"},
|
||||
"color_light": {"score": 0.0, "diff": "ref warm vs gen cool/flat"}
|
||||
},
|
||||
"fix_suggestions": ["reduce cast to 1 woman", "set clothing=lingerie", ...]
|
||||
}
|
||||
```
|
||||
|
||||
The axis list is **configurable** on the node so it can match whichever Prompt‑Builder
|
||||
knobs you expose (cast, clothing, pose, scene/location, composition/framing, expression,
|
||||
color/lighting). `fix_suggestions` is phrased in axis vocabulary so the controller can
|
||||
map each one onto a knob.
|
||||
|
||||
### Reducing VLM‑as‑judge variance (important)
|
||||
|
||||
VLM scoring is noisy and biased. Mitigations baked into the node / recommended:
|
||||
|
||||
1. **Position‑bias swap** — run the judge twice with reference/generated order swapped and
|
||||
average the per‑axis scores (`swap_eval=True`). Cuts the "first image wins" bias.
|
||||
2. **Low temperature** (0.0–0.3) + a **fixed rubric** in the system prompt → repeatable scores.
|
||||
3. **Anchored 0–1 rubric** (0 = unrelated, 0.5 = same category/different details, 1 = near‑identical) so scores are comparable across iterations.
|
||||
4. **Evidence‑first**: ask the model to state the concrete difference *before* the number; reasoning‑then‑score is measurably more reliable than score‑then‑reasoning.
|
||||
5. **Average over k T2I seeds** for the *same* prompt if you want the score to reflect the prompt rather than sampler noise — or, cheaper, **freeze the T2I seed** during the axis search and only vary it once at the end.
|
||||
|
||||
---
|
||||
|
||||
## 4. The calibrator / controller
|
||||
|
||||
> **Chosen design: the controller is an external CLI agent, not an in‑graph node.**
|
||||
> The agent reads the Judge's text/JSON analysis, calibrates the prompt, injects it into
|
||||
> the `CalibratorPromptReceptor` node, and queues ComfyUI via its HTTP API — one
|
||||
> `prompt_id` per iteration. See **[AGENT_LOOP.md](AGENT_LOOP.md)** and `agent_bridge.py`.
|
||||
> The options below describe the *policy* the agent can run.
|
||||
|
||||
Prompt‑Builder is a **deterministic, seeded, combinatorial** generator (it is *not* an
|
||||
LLM). So "calibration" = **searching the space of `(seed, profile, per‑axis overrides)`**
|
||||
to maximize `overall_score`. Three controller options, easiest → strongest:
|
||||
|
||||
1. **Greedy per‑axis hill‑climb (start here).**
|
||||
For each axis with the lowest score, apply the matching `fix_suggestion` as a knob
|
||||
override (e.g. set `clothing=lingerie`, `cast_women=1`), regenerate, keep the change
|
||||
if `overall_score` improved, else revert. Loop until ≥ target or no axis improves.
|
||||
Implementable today with the Prompt‑Builder **For‑Loop Start/End + Accumulator** nodes.
|
||||
|
||||
2. **Black‑box optimizer over the knob vector.**
|
||||
Encode the exposed knobs as a parameter vector and drive it with Optuna / CMA‑ES /
|
||||
a simple bandit, objective = `overall_score`. Better for >3–4 interacting axes; needs
|
||||
a thin Python controller node that holds state across iterations.
|
||||
|
||||
3. **LLM‑in‑the‑loop rewriter.**
|
||||
Feed `diff_analysis` to a (local) text LLM that proposes the next knob settings (or,
|
||||
if you move to free‑text prompts, rewrites the prompt). Most flexible, least
|
||||
reproducible — use the same abliterated Qwen3 text head to keep it local and uncensored.
|
||||
|
||||
**Loop hygiene:** fix resolution/sampler/steps across iterations; freeze T2I seed while
|
||||
searching; stop on `overall_score ≥ target` (e.g. 0.85) **or** `max_iters`; log every
|
||||
`(knobs, score, diff)` triple so the search is auditable and resumable.
|
||||
|
||||
---
|
||||
|
||||
## 5. Concrete build order
|
||||
|
||||
1. **Judge node** (this repo, `nodes/qwen_judge.py`) — load local Qwen3‑VL‑4B abliterated,
|
||||
take ref+gen, output `overall_score (FLOAT)`, `axis_scores (JSON STRING)`,
|
||||
`diff_analysis (STRING)`, `raw (STRING)`. ✅ scaffolded.
|
||||
2. **Wire the loop** in a workflow: Prompt‑Builder → T2I → Judge → Accumulator, using the
|
||||
SxCP For‑Loop nodes; route `overall_score` into the loop's stop condition.
|
||||
3. **Controller node** — start with greedy per‑axis hill‑climb that reads `diff_analysis`
|
||||
and emits knob overrides back into Prompt‑Builder's split control nodes.
|
||||
4. **Tune the judge** — calibrate the rubric on a handful of known ref/gen pairs; enable
|
||||
`swap_eval`; pick temperature; decide if you need to step up to 8B/30B‑A3B.
|
||||
|
||||
See [README.md](../README.md) for install/usage of the Judge node.
|
||||
Reference in New Issue
Block a user