Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions
The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+11
-12
@@ -18,8 +18,8 @@ reads the analysis, calibrates the prompt generator, and queues the next iterati
|
||||
│ writes calib_<tag>.json + latest.json
|
||||
3. poll /history/{id} (bridge does this) ◄───────────┘
|
||||
4. read report JSON (overall_score,
|
||||
per-axis diffs, fix_suggestions)
|
||||
5. adjust Prompt-Builder knobs / prompt
|
||||
per-axis score + ref/gen values)
|
||||
5. steer prompt toward ref on worst axes
|
||||
└──► go to 1 until overall_score ≥ target
|
||||
```
|
||||
|
||||
@@ -60,23 +60,22 @@ Stdout (captured by the agent) is the report:
|
||||
"run_tag": "iter003",
|
||||
"overall_score": 0.62,
|
||||
"axes": {
|
||||
"pose": {"score": 0.40, "diff": "ref standing, gen seated"},
|
||||
"clothing": {"score": 0.85, "diff": "close; gen lacks lace detail"}
|
||||
"position": {"score": 0.40, "ref": "doggy style", "gen": "missionary"},
|
||||
"clothing_state": {"score": 0.85, "ref": "red lace lingerie", "gen": "plain bra"}
|
||||
},
|
||||
"fix_suggestions": ["set pose=standing", "add 'lace trim' to clothing"],
|
||||
"prompt_used": "1 woman, red lingerie, ...",
|
||||
"prompt_used": "...",
|
||||
"_prompt_id": "…", "_report_path": "…/calib_iter003.json"
|
||||
}
|
||||
```
|
||||
|
||||
## Agent calibration policy (suggested)
|
||||
|
||||
The agent maps the lowest-scoring axes onto Prompt-Builder knobs and applies the
|
||||
`fix_suggestions`, regenerates, and keeps changes that raise `overall_score`
|
||||
(greedy per-axis hill-climb). Keep the **T2I seed fixed** while searching prompt axes so
|
||||
the score reflects the prompt, not sampler noise; vary the seed only once you're near the
|
||||
target. Stop at `overall_score ≥ target` (e.g. 0.85) or a max-iteration budget. Log every
|
||||
`(prompt, knobs, score)` so the search is auditable/resumable.
|
||||
For the lowest-scoring axes, the agent rewrites that axis's prompt wording to match its
|
||||
`ref` value (the target), regenerates, and keeps changes that raise `overall_score`
|
||||
(greedy per-axis hill-climb). The local model supplies no fixes — the agent owns the
|
||||
correction. Keep the **T2I seed fixed** while searching so the score reflects the prompt,
|
||||
not sampler noise; vary the seed only once near target. Stop at `overall_score ≥ target`
|
||||
(e.g. 0.85) or a max-iteration budget. Full policy: **[CALIBRATION_POLICY.md](CALIBRATION_POLICY.md)**.
|
||||
|
||||
## Setup checklist
|
||||
|
||||
|
||||
Reference in New Issue
Block a user