Files
8-cut/docs/plans/2026-04-19-scan-history-negatives-design.md
Ethanfel 6c1d42adfe feat: vid folder layout, changelog popup, shift-to-resize, DB migration
- Export layout changed from clip_NNN group dirs to vid_NNN per-video folders
- Automatic DB migration rewrites old paths and moves files on startup
- Per-video counter with DB cross-check to prevent overwrites
- Changelog popup on version bump with "don't show again" checkbox
- Scan region resize now requires Shift+drag to prevent accidental edits
- Recalculate vid folder and counter on file load
- Add EAT_LARGE embedding model variant
- Update tests for new flat export path structure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 17:01:37 +02:00

206 lines
9.1 KiB
Markdown

# Scan History & Hard Negative Management — Final Design
Date: 2026-04-19 (implemented on `feat/training-ui`)
## Goal
1. Keep scan result history per `(file, model)` so users can track classifier improvement across training iterations
2. Make hard negatives manageable — viewable, removable, and optionally disabled per training run
3. Fix latent bug: `get_export_folders()` doesn't filter by `scan_export`
---
## 1. Ghost Folder Fix
### Bug
`get_export_folders()` queried all `output_path` rows without filtering `scan_export`. Folders that only contained scan-exported clips appeared in training dropdowns with 0 clips.
### Implementation (`core/db.py`)
**`get_export_folders(profile, include_scan_exports=False)`** — new parameter. When `False` (default), the SQL query adds `AND scan_export = 0` to exclude scan-only folders. The `get_training_stats()` method passes this through and also filters its return dict to remove folders with 0 clips:
```python
return {k: v for k, v in stats.items() if v["clips"] > 0}
```
### Test
`tests/test_db.py::test_export_folders_excludes_scan_exports` — verifies scan-only folders are excluded by default and included when `include_scan_exports=True`.
---
## 2. Scan Result History
### Schema
Added column to `scan_results`:
```sql
scan_timestamp TEXT NOT NULL DEFAULT ''
```
All rows from the same scan share one timestamp string with **microsecond precision** (`%Y%m%d_%H%M%S_%f`, e.g. `"20260419_143022_123456"`). Microsecond precision prevents version collisions on fast successive scans.
Migration adds the column via `ALTER TABLE` for existing databases. Legacy rows keep `scan_timestamp = ''`.
### DB methods (`core/db.py`)
**`save_scan_results(filename, profile, model, regions, max_versions=5)`**
1. Inserts new rows with current microsecond-precision timestamp
2. Counts distinct timestamps for this `(filename, profile, model)`
3. Prunes oldest timestamps beyond `max_versions`
No more DELETE-then-INSERT — all versions coexist in the table.
**`get_scan_versions(filename, profile, model)`**
Returns `[{timestamp, count, max_score}, ...]` ordered newest first. Filters `scan_timestamp != ''` so legacy rows don't appear as named versions.
**`get_scan_results(filename, profile, scan_timestamp=None)`**
- With `scan_timestamp`: returns rows matching that exact version
- Without (default): uses `INNER JOIN` subquery with `MAX(scan_timestamp)` per model to return only the latest version. Legacy rows (empty timestamp) sort before any real timestamp, so they're returned when no versioned scans exist.
### UI (`main.py` — `ScanResultsPanel`)
Each model tab wraps its `QTableWidget` in a container `QWidget` with a `QComboBox` for version selection:
```
container (QWidget)
├── cmb_version (QComboBox) — hidden when ≤ 1 version
└── table (QTableWidget)
```
**Helper methods** unwrap this container:
- `_current_table()` — returns `QTableWidget` from active tab (handles both raw table and container)
- `_tab_table(index)` — same by tab index
**Version combo** is populated by `_populate_version_combos()` after every `load_for_file()` and `add_scan_results()` call. Labels use `datetime.strptime` parsing with try/except fallback for robustness:
```
2026-04-19 14:30 (12 regions, best: 0.95)
```
**Version switching** via `_on_version_changed(model, idx)`:
1. Reads `scan_timestamp` from combo's `userData`
2. Calls `get_scan_results(filename, profile, scan_timestamp=ts)`
3. Repopulates the table in-place
4. **Clears the undo stack** — stale undo entries from a different version would corrupt data
5. Emits `regions_edited` to refresh the timeline
**Tab switch** connects `tab_changed` signal to `_on_scan_regions_edited` (not just `_update_scan_export_count`), so the timeline updates scan regions when switching model tabs.
### Cache interaction
Embedding cache is per `(file, model)` and doesn't change across scans. History stores classified regions (start, end, score), not embeddings.
### Test
`tests/test_db.py::test_scan_result_history` — saves 3 versions, verifies counts, ordering, and latest-by-default behavior.
---
## 3. Hard Negative Management
### Schema
Added column to `hard_negatives`:
```sql
source_model TEXT NOT NULL DEFAULT ''
```
Migration adds the column via `ALTER TABLE` for existing databases.
### DB methods (`core/db.py`)
**`add_hard_negatives(filename, profile, times, source_path="", source_model="")`** — now stores which embedding model produced the scan that led to the negative marking.
**`get_hard_negatives(profile)`** — returns all rows as `[{id, filename, start_time, source_path, source_model}, ...]` for the management dialog.
**`delete_hard_negatives_by_ids(ids)`** — bulk delete by row IDs.
**`get_training_data(..., use_hard_negatives=True)`** — new parameter. When `False`, the hard negatives query is skipped entirely. Non-destructive — negatives remain in DB.
### Source model tracking (`main.py`)
`_on_scan_negatives()` now passes `source_model=self._scan_panel.current_model_name()` when marking negatives from scan results. `current_model_name()` extracts the model name from the active tab text (stripping the count suffix).
### Training toggle (`main.py` — `TrainDialog`)
Checkbox **"Use hard negatives in training"** (default checked) with "Manage..." button in an HBox layout. The toggle:
- Updates live training stats preview via debounced `_update_stats()`
- Passes `use_hard_negatives` through `_open_train_dialog()` to `get_training_data()`
### Management dialog (`main.py` — `HardNegativesDialog`)
Accessible from TrainDialog's "Manage..." button. Features:
| Component | Details |
|-----------|---------|
| **Filter combo** | `(all)` + each distinct `source_model` found in data |
| **Summary label** | `<b>N</b> hard negatives` |
| **Table** | File, Time (`{:.1f}s`), Source Model, hidden ID column |
| **Delete Selected** | Multi-select aware, skips hidden (filtered) rows |
| **Clear All** | **Filter-aware**: if a model filter is active, only deletes negatives for that model with an appropriate confirmation message. If `(all)`, deletes everything. |
| **Close** | Closes dialog, triggers stats refresh in parent TrainDialog |
`blockSignals(True)` guards prevent spurious filter callbacks during `_load()` repopulation.
### Tests
- `test_hard_negatives_source_model` — verifies source_model stored and retrieved
- `test_training_data_skips_hard_negatives` — verifies `use_hard_negatives=False` excludes them
- `test_delete_hard_negatives_by_ids` — verifies bulk deletion by ID
---
## 4. Runtime Fixes (discovered during testing)
### EAT/torchvision ABI mismatch
**Problem:** `torchvision` installed from PyPI (CPU build) was incompatible with `torch` from CUDA wheel index, causing `operator torchvision::nms does not exist`.
**Fix:** Added `torchvision` to the explicit torch install line in both setup scripts:
```bash
pip install torch torchaudio torchvision --index-url "$TORCH_INDEX"
```
Also added `--extra-index-url "$TORCH_INDEX"` to the `pip install -r requirements.txt` line to prevent transitive dependencies (timm, ultralytics) from pulling CPU-only torch packages.
Applied to: `setup_env.sh` (both conda and venv paths), `setup-windows.ps1`.
### EAT / transformers 5.x incompatibility
**Problem:** transformers 5.x broke EAT's remote model code (`'EATModel' object has no attribute 'all_tied_weights_keys'`).
**Fix:** Pinned `transformers>=4.30,<5.0` in `requirements.txt`.
### NumPy non-writable array warning
**Problem:** Cached HuBERT/EAT embeddings loaded from disk are read-only numpy arrays. `torch.from_numpy()` on a non-writable array triggers a deprecation warning.
**Fix:** In `core/audio_scan.py`, changed EAT preprocessing to copy the array:
```python
wav = torch.from_numpy(np.array(chunk)).unsqueeze(0).float()
```
### Timeline not updating on tab switch
**Problem:** Switching model tabs in the scan results panel didn't refresh the timeline's highlighted regions because `tab_changed` was only connected to `_update_scan_export_count`.
**Fix:** Connected `tab_changed` to `_on_scan_regions_edited` instead, which handles both timeline refresh and export count update.
---
## File Summary
| File | Changes |
|------|---------|
| `core/db.py` | Schema migrations, `get_export_folders` filter, versioned `save_scan_results`, `get_scan_versions`, version-aware `get_scan_results`, `add_hard_negatives` with `source_model`, `get_hard_negatives`, `delete_hard_negatives_by_ids`, `get_training_data` with `use_hard_negatives` |
| `main.py` | `HardNegativesDialog` class, `TrainDialog` hard neg toggle + manage button, `ScanResultsPanel` container/combo architecture, version combo population and switching, `current_model_name()`, tab-switch timeline fix |
| `core/audio_scan.py` | `np.array(chunk)` copy for read-only numpy arrays in EAT preprocessing |
| `requirements.txt` | `transformers>=4.30,<5.0` pin |
| `setup_env.sh` | `torchvision` in torch install, `--extra-index-url` on requirements install |
| `setup-windows.ps1` | `torchvision` in torch install, `--extra-index-url` on requirements install, removed skip-if-exists guard |
| `tests/test_db.py` | 5 tests covering all DB-layer changes |