Re-enable reasoning for accurate verdicts (no-think rubber-stamped 'match')

Disabling thinking made reasoning models mark everything 'match' even when ref/gen clearly differ. Added an enable_thinking toggle (default ON) threaded through the generation path; the prompt now allows reasoning then asks for the result, and verdict_rule explicitly warns against lazy 'match'. _parse_json now scans for the JSON object AFTER the reasoning prose (last balanced object with 'axes'), and the markdown fallback already reads reasoned per-axis output. Default max_new_tokens 2048->3072 so verdicts don't get cut off. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 10:56:47 +02:00
parent fee136e98c
commit 22fd24b29e
4 changed files with 96 additions and 74 deletions
@@ -68,7 +68,7 @@
      "model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
      "precision": "bf16",
      "profile": "general",
-      "max_new_tokens": 2048,
+      "max_new_tokens": 3072,
      "temperature": 0.0,
      "swap_eval": true,
      "keep_loaded": true,
@@ -12,7 +12,7 @@
      "profile": "general",
      "model_path": "/media/p5/qwen3vl_4b_abliterated_comfy_convert/hf_bf16",
      "precision": "bf16",
-      "max_new_tokens": 2048,
+      "max_new_tokens": 3072,
      "temperature": 0.0,
      "swap_eval": false,
      "keep_loaded": true,