docs: add scan history & hard negative management design + plan
Covers scan result versioning per model, hard negative management dialog with training toggle, and ghost folder fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,90 @@
|
||||
# Scan History & Hard Negative Management Design
|
||||
|
||||
Date: 2026-04-19
|
||||
|
||||
## Goal
|
||||
|
||||
1. Keep scan result history per `(file, model)` so users can track classifier improvement across training iterations
|
||||
2. Make hard negatives manageable — viewable, removable, and optionally disabled per training run
|
||||
3. Fix latent bug: `get_export_folders()` doesn't filter by `scan_export`
|
||||
|
||||
## 1. Scan Result History
|
||||
|
||||
### Current behavior
|
||||
|
||||
`save_scan_results()` **replaces** all results for `(filename, profile, model)` on every scan. No history is preserved.
|
||||
|
||||
### Change
|
||||
|
||||
Keep the last N scan results per `(filename, profile, model)` with timestamps. The most recent is the "active" result displayed in the panel; older versions are accessible for comparison.
|
||||
|
||||
### Schema change
|
||||
|
||||
Add column to `scan_results`:
|
||||
|
||||
```sql
|
||||
ALTER TABLE scan_results ADD COLUMN scan_timestamp TEXT NOT NULL DEFAULT '';
|
||||
```
|
||||
|
||||
All rows from the same scan share the same timestamp string (e.g. `"20260419_143022"`).
|
||||
|
||||
### save_scan_results changes
|
||||
|
||||
Instead of `DELETE ... WHERE filename=? AND profile=? AND model=?`, the new flow:
|
||||
|
||||
1. Insert new rows with current timestamp
|
||||
2. Count distinct timestamps for this `(filename, profile, model)`
|
||||
3. If count > N (default 5), delete rows belonging to the oldest timestamps
|
||||
|
||||
### UI changes
|
||||
|
||||
Add a small version dropdown/selector in `ScanResultsPanel` per model tab — shows timestamps of available scan versions. Selecting a version loads that version's results into the tab. The most recent is selected by default.
|
||||
|
||||
The tab label shows the active version's region count, e.g. `HUBERT_XLARGE (12) [v3]`.
|
||||
|
||||
### Cache interaction
|
||||
|
||||
Embedding cache is per `(file, model)` and doesn't change across scans. Only the classifier output changes. History stores the classified regions (start, end, score), not embeddings.
|
||||
|
||||
## 2. Hard Negative Management
|
||||
|
||||
### Current behavior
|
||||
|
||||
- Hard negatives stored in `hard_negatives` table: `(filename, profile, start_time, source_path)`
|
||||
- No model column — applied globally within a profile
|
||||
- Removable one-by-one via N toggle in scan panel, but no bulk management
|
||||
- Always used in training — no way to disable
|
||||
|
||||
### Changes
|
||||
|
||||
#### Schema
|
||||
|
||||
Add `source_model TEXT NOT NULL DEFAULT ''` column to `hard_negatives`. Populated when marking negatives from scan results (we know which model tab is active).
|
||||
|
||||
#### Training toggle
|
||||
|
||||
New checkbox in `TrainDialog`: **"Use hard negatives"** (default checked). When unchecked, `get_training_data()` skips the `hard_negatives` query entirely. Non-destructive — negatives remain in DB.
|
||||
|
||||
#### Management dialog
|
||||
|
||||
New `HardNegativesDialog` accessible from Train dialog via "Manage..." button next to the checkbox. Shows:
|
||||
|
||||
- Table: filename, start time, source model, date added (if we add created_at)
|
||||
- Filter by source model (dropdown)
|
||||
- Multi-select + Delete button
|
||||
- "Clear All" button with confirmation
|
||||
- Count summary at top
|
||||
|
||||
### Training integration
|
||||
|
||||
`get_training_data()` gets a new `use_hard_negatives: bool = True` parameter. When False, the hard negatives query (lines 365-374 of db.py) is skipped entirely.
|
||||
|
||||
## 3. Ghost Folder Fix
|
||||
|
||||
### Bug
|
||||
|
||||
`get_export_folders()` queries all `output_path` rows without filtering `scan_export`. Folders that only contain scan-exported clips appear in training dropdowns with 0 clips.
|
||||
|
||||
### Fix
|
||||
|
||||
Add `include_scan_exports` parameter to `get_export_folders()`. When False (default), only query rows with `scan_export = 0`. Also filter out folders with 0 clips from `get_training_stats()` result dict.
|
||||
Reference in New Issue
Block a user