Re-enable reasoning for accurate verdicts (no-think rubber-stamped 'match')

Disabling thinking made reasoning models mark everything 'match' even when ref/gen
clearly differ. Added an enable_thinking toggle (default ON) threaded through the
generation path; the prompt now allows reasoning then asks for the result, and
verdict_rule explicitly warns against lazy 'match'. _parse_json now scans for the
JSON object AFTER the reasoning prose (last balanced object with 'axes'), and the
markdown fallback already reads reasoned per-axis output. Default max_new_tokens
2048->3072 so verdicts don't get cut off.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-27 10:56:47 +02:00
parent fee136e98c
commit 22fd24b29e
4 changed files with 96 additions and 74 deletions
+2 -1
View File
@@ -38,7 +38,8 @@ can act on it.
| `precision` | bf16 / fp8 / nf4 | bf16 | **the quant** — applies to the selected model (VRAM table below) |
| `model_path` | STRING | "" (empty) | **manual override** of the dropdown — local dir, HF repo id, or alias (`8b`/`30b-a3b`/`3.5-9b`/`3.6-27b`/`3.6-35b`). Empty = use `model_select` |
| `axes` | STRING **input** | — | (socket) optional override of the profile's axis set; wire a text node or leave unconnected to use `profile` |
| `max_new_tokens` | INT | 2048 | raise it if a reasoning model (Qwen3.5/3.6) gets cut off before finishing |
| `max_new_tokens` | INT | 3072 | reasoning models (Qwen3.5/3.6) need room; raise it if the verdict gets cut off |
| `enable_thinking` | BOOL | true | let the model reason before judging. **Keep on for accurate verdicts** — off makes reasoning models rubber-stamp `match`. Off is faster |
| `temperature` | FLOAT | 0.0 | 0 = greedy/repeatable |
| `swap_eval` | BOOL | true | run twice with images swapped, average → cuts position bias |
| `keep_loaded` | BOOL | true | cache weights across loop iterations |