Redesign judge output for calibration: per-axis {score, ref, gen}, drop local fix suggestions

The local VLM now only observes and scores; correction is left to the stronger external agent. Each axis reports the target value (ref), the current value (gen) and the closeness (score) — the target/current/distance an agent needs to calibrate. Expanded to ~20 granular axes (identity/body/wardrobe/action/affect/ camera/render) so the action cluster stays discriminative for explicit content. swap_eval now inverts ref/gen of the swapped pass; diff summary sorts worst-first; default max_new_tokens 1024. Docs aligned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:52:40 +02:00
parent aa3983d94a
commit 959ec70065
6 changed files with 188 additions and 164 deletions
@@ -19,7 +19,7 @@ Stdlib only — no third-party deps, so any agent can shell out to it.
 Loop, from the agent's side:
    1. build a prompt (calibrate from the previous analysis)
    2. run this script -> capture stdout (the analysis JSON)
-    3. read overall_score + per-axis diffs + fix_suggestions
+    3. read overall_score + per-axis {score, ref, gen}
    4. adjust the prompt and go to 1, until overall_score >= target
 """