Files
8-cut/docs/plans/2026-04-19-scan-history-negatives-design.md
T
Ethanfel e7b791fbfa docs: add scan history & hard negative management design + plan
Covers scan result versioning per model, hard negative management
dialog with training toggle, and ghost folder fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:51:17 +02:00

3.4 KiB

Scan History & Hard Negative Management Design

Date: 2026-04-19

Goal

  1. Keep scan result history per (file, model) so users can track classifier improvement across training iterations
  2. Make hard negatives manageable — viewable, removable, and optionally disabled per training run
  3. Fix latent bug: get_export_folders() doesn't filter by scan_export

1. Scan Result History

Current behavior

save_scan_results() replaces all results for (filename, profile, model) on every scan. No history is preserved.

Change

Keep the last N scan results per (filename, profile, model) with timestamps. The most recent is the "active" result displayed in the panel; older versions are accessible for comparison.

Schema change

Add column to scan_results:

ALTER TABLE scan_results ADD COLUMN scan_timestamp TEXT NOT NULL DEFAULT '';

All rows from the same scan share the same timestamp string (e.g. "20260419_143022").

save_scan_results changes

Instead of DELETE ... WHERE filename=? AND profile=? AND model=?, the new flow:

  1. Insert new rows with current timestamp
  2. Count distinct timestamps for this (filename, profile, model)
  3. If count > N (default 5), delete rows belonging to the oldest timestamps

UI changes

Add a small version dropdown/selector in ScanResultsPanel per model tab — shows timestamps of available scan versions. Selecting a version loads that version's results into the tab. The most recent is selected by default.

The tab label shows the active version's region count, e.g. HUBERT_XLARGE (12) [v3].

Cache interaction

Embedding cache is per (file, model) and doesn't change across scans. Only the classifier output changes. History stores the classified regions (start, end, score), not embeddings.

2. Hard Negative Management

Current behavior

  • Hard negatives stored in hard_negatives table: (filename, profile, start_time, source_path)
  • No model column — applied globally within a profile
  • Removable one-by-one via N toggle in scan panel, but no bulk management
  • Always used in training — no way to disable

Changes

Schema

Add source_model TEXT NOT NULL DEFAULT '' column to hard_negatives. Populated when marking negatives from scan results (we know which model tab is active).

Training toggle

New checkbox in TrainDialog: "Use hard negatives" (default checked). When unchecked, get_training_data() skips the hard_negatives query entirely. Non-destructive — negatives remain in DB.

Management dialog

New HardNegativesDialog accessible from Train dialog via "Manage..." button next to the checkbox. Shows:

  • Table: filename, start time, source model, date added (if we add created_at)
  • Filter by source model (dropdown)
  • Multi-select + Delete button
  • "Clear All" button with confirmation
  • Count summary at top

Training integration

get_training_data() gets a new use_hard_negatives: bool = True parameter. When False, the hard negatives query (lines 365-374 of db.py) is skipped entirely.

3. Ghost Folder Fix

Bug

get_export_folders() queries all output_path rows without filtering scan_export. Folders that only contain scan-exported clips appear in training dropdowns with 0 clips.

Fix

Add include_scan_exports parameter to get_export_folders(). When False (default), only query rows with scan_export = 0. Also filter out folders with 0 clips from get_training_stats() result dict.