feat: vid folder layout, changelog popup, shift-to-resize, DB migration

- Export layout changed from clip_NNN group dirs to vid_NNN per-video folders - Automatic DB migration rewrites old paths and moves files on startup - Per-video counter with DB cross-check to prevent overwrites - Changelog popup on version bump with "don't show again" checkbox - Scan region resize now requires Shift+drag to prevent accidental edits - Recalculate vid folder and counter on file load - Add EAT_LARGE embedding model variant - Update tests for new flat export path structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 17:01:37 +02:00
parent d8b3972bdc
commit 6c1d42adfe
7 changed files with 509 additions and 816 deletions
@@ -1,6 +1,6 @@
-# Scan History & Hard Negative Management Design
+# Scan History & Hard Negative Management — Final Design

-Date: 2026-04-19
+Date: 2026-04-19 (implemented on `feat/training-ui`)

 ## Goal

@@ -8,83 +8,198 @@ Date: 2026-04-19
 2. Make hard negatives manageable — viewable, removable, and optionally disabled per training run
 3. Fix latent bug: `get_export_folders()` doesn't filter by `scan_export`

-## 1. Scan Result History
+---

-### Current behavior
-
-`save_scan_results()` **replaces** all results for `(filename, profile, model)` on every scan. No history is preserved.
-
-### Change
-
-Keep the last N scan results per `(filename, profile, model)` with timestamps. The most recent is the "active" result displayed in the panel; older versions are accessible for comparison.
-
-### Schema change
-
-Add column to `scan_results`:
-
-```sql
-ALTER TABLE scan_results ADD COLUMN scan_timestamp TEXT NOT NULL DEFAULT '';
-```
-
-All rows from the same scan share the same timestamp string (e.g. `"20260419_143022"`).
-
-### save_scan_results changes
-
-Instead of `DELETE ... WHERE filename=? AND profile=? AND model=?`, the new flow:
-
-1. Insert new rows with current timestamp
-2. Count distinct timestamps for this `(filename, profile, model)`
-3. If count > N (default 5), delete rows belonging to the oldest timestamps
-
-### UI changes
-
-Add a small version dropdown/selector in `ScanResultsPanel` per model tab — shows timestamps of available scan versions. Selecting a version loads that version's results into the tab. The most recent is selected by default.
-
-The tab label shows the active version's region count, e.g. `HUBERT_XLARGE (12) [v3]`.
-
-### Cache interaction
-
-Embedding cache is per `(file, model)` and doesn't change across scans. Only the classifier output changes. History stores the classified regions (start, end, score), not embeddings.
-
-## 2. Hard Negative Management
-
-### Current behavior
-
- Hard negatives stored in `hard_negatives` table: `(filename, profile, start_time, source_path)`
- No model column — applied globally within a profile
- Removable one-by-one via N toggle in scan panel, but no bulk management
- Always used in training — no way to disable
-
-### Changes
-
-#### Schema
-
-Add `source_model TEXT NOT NULL DEFAULT ''` column to `hard_negatives`. Populated when marking negatives from scan results (we know which model tab is active).
-
-#### Training toggle
-
-New checkbox in `TrainDialog`: **"Use hard negatives"** (default checked). When unchecked, `get_training_data()` skips the `hard_negatives` query entirely. Non-destructive — negatives remain in DB.
-
-#### Management dialog
-
-New `HardNegativesDialog` accessible from Train dialog via "Manage..." button next to the checkbox. Shows:
-
- Table: filename, start time, source model, date added (if we add created_at)
- Filter by source model (dropdown)
- Multi-select + Delete button
- "Clear All" button with confirmation
- Count summary at top
-
-### Training integration
-
-`get_training_data()` gets a new `use_hard_negatives: bool = True` parameter. When False, the hard negatives query (lines 365-374 of db.py) is skipped entirely.
-
-## 3. Ghost Folder Fix
+## 1. Ghost Folder Fix

 ### Bug

-`get_export_folders()` queries all `output_path` rows without filtering `scan_export`. Folders that only contain scan-exported clips appear in training dropdowns with 0 clips.
+`get_export_folders()` queried all `output_path` rows without filtering `scan_export`. Folders that only contained scan-exported clips appeared in training dropdowns with 0 clips.

-### Fix
+### Implementation (`core/db.py`)

-Add `include_scan_exports` parameter to `get_export_folders()`. When False (default), only query rows with `scan_export = 0`. Also filter out folders with 0 clips from `get_training_stats()` result dict.
+**`get_export_folders(profile, include_scan_exports=False)`** — new parameter. When `False` (default), the SQL query adds `AND scan_export = 0` to exclude scan-only folders. The `get_training_stats()` method passes this through and also filters its return dict to remove folders with 0 clips:
+
+```python
+return {k: v for k, v in stats.items() if v["clips"] > 0}
+```
+
+### Test
+
+`tests/test_db.py::test_export_folders_excludes_scan_exports` — verifies scan-only folders are excluded by default and included when `include_scan_exports=True`.
+
+---
+
+## 2. Scan Result History
+
+### Schema
+
+Added column to `scan_results`:
+
+```sql
+scan_timestamp TEXT NOT NULL DEFAULT ''
+```
+
+All rows from the same scan share one timestamp string with **microsecond precision** (`%Y%m%d_%H%M%S_%f`, e.g. `"20260419_143022_123456"`). Microsecond precision prevents version collisions on fast successive scans.
+
+Migration adds the column via `ALTER TABLE` for existing databases. Legacy rows keep `scan_timestamp = ''`.
+
+### DB methods (`core/db.py`)
+
+**`save_scan_results(filename, profile, model, regions, max_versions=5)`**
+1. Inserts new rows with current microsecond-precision timestamp
+2. Counts distinct timestamps for this `(filename, profile, model)`
+3. Prunes oldest timestamps beyond `max_versions`
+
+No more DELETE-then-INSERT — all versions coexist in the table.
+
+**`get_scan_versions(filename, profile, model)`**
+Returns `[{timestamp, count, max_score}, ...]` ordered newest first. Filters `scan_timestamp != ''` so legacy rows don't appear as named versions.
+
+**`get_scan_results(filename, profile, scan_timestamp=None)`**
+- With `scan_timestamp`: returns rows matching that exact version
+- Without (default): uses `INNER JOIN` subquery with `MAX(scan_timestamp)` per model to return only the latest version. Legacy rows (empty timestamp) sort before any real timestamp, so they're returned when no versioned scans exist.
+
+### UI (`main.py` — `ScanResultsPanel`)
+
+Each model tab wraps its `QTableWidget` in a container `QWidget` with a `QComboBox` for version selection:
+
+```
+container (QWidget)
+├── cmb_version (QComboBox) — hidden when ≤ 1 version
+└── table (QTableWidget)
+```
+
+**Helper methods** unwrap this container:
+- `_current_table()` — returns `QTableWidget` from active tab (handles both raw table and container)
+- `_tab_table(index)` — same by tab index
+
+**Version combo** is populated by `_populate_version_combos()` after every `load_for_file()` and `add_scan_results()` call. Labels use `datetime.strptime` parsing with try/except fallback for robustness:
+
+```
+2026-04-19 14:30 (12 regions, best: 0.95)
+```
+
+**Version switching** via `_on_version_changed(model, idx)`:
+1. Reads `scan_timestamp` from combo's `userData`
+2. Calls `get_scan_results(filename, profile, scan_timestamp=ts)`
+3. Repopulates the table in-place
+4. **Clears the undo stack** — stale undo entries from a different version would corrupt data
+5. Emits `regions_edited` to refresh the timeline
+
+**Tab switch** connects `tab_changed` signal to `_on_scan_regions_edited` (not just `_update_scan_export_count`), so the timeline updates scan regions when switching model tabs.
+
+### Cache interaction
+
+Embedding cache is per `(file, model)` and doesn't change across scans. History stores classified regions (start, end, score), not embeddings.
+
+### Test
+
+`tests/test_db.py::test_scan_result_history` — saves 3 versions, verifies counts, ordering, and latest-by-default behavior.
+
+---
+
+## 3. Hard Negative Management
+
+### Schema
+
+Added column to `hard_negatives`:
+
+```sql
+source_model TEXT NOT NULL DEFAULT ''
+```
+
+Migration adds the column via `ALTER TABLE` for existing databases.
+
+### DB methods (`core/db.py`)
+
+**`add_hard_negatives(filename, profile, times, source_path="", source_model="")`** — now stores which embedding model produced the scan that led to the negative marking.
+
+**`get_hard_negatives(profile)`** — returns all rows as `[{id, filename, start_time, source_path, source_model}, ...]` for the management dialog.
+
+**`delete_hard_negatives_by_ids(ids)`** — bulk delete by row IDs.
+
+**`get_training_data(..., use_hard_negatives=True)`** — new parameter. When `False`, the hard negatives query is skipped entirely. Non-destructive — negatives remain in DB.
+
+### Source model tracking (`main.py`)
+
+`_on_scan_negatives()` now passes `source_model=self._scan_panel.current_model_name()` when marking negatives from scan results. `current_model_name()` extracts the model name from the active tab text (stripping the count suffix).
+
+### Training toggle (`main.py` — `TrainDialog`)
+
+Checkbox **"Use hard negatives in training"** (default checked) with "Manage..." button in an HBox layout. The toggle:
+- Updates live training stats preview via debounced `_update_stats()`
+- Passes `use_hard_negatives` through `_open_train_dialog()` to `get_training_data()`
+
+### Management dialog (`main.py` — `HardNegativesDialog`)
+
+Accessible from TrainDialog's "Manage..." button. Features:
+
+| Component | Details |
+|-----------|---------|
+| **Filter combo** | `(all)` + each distinct `source_model` found in data |
+| **Summary label** | `<b>N</b> hard negatives` |
+| **Table** | File, Time (`{:.1f}s`), Source Model, hidden ID column |
+| **Delete Selected** | Multi-select aware, skips hidden (filtered) rows |
+| **Clear All** | **Filter-aware**: if a model filter is active, only deletes negatives for that model with an appropriate confirmation message. If `(all)`, deletes everything. |
+| **Close** | Closes dialog, triggers stats refresh in parent TrainDialog |
+
+`blockSignals(True)` guards prevent spurious filter callbacks during `_load()` repopulation.
+
+### Tests
+
+- `test_hard_negatives_source_model` — verifies source_model stored and retrieved
+- `test_training_data_skips_hard_negatives` — verifies `use_hard_negatives=False` excludes them
+- `test_delete_hard_negatives_by_ids` — verifies bulk deletion by ID
+
+---
+
+## 4. Runtime Fixes (discovered during testing)
+
+### EAT/torchvision ABI mismatch
+
+**Problem:** `torchvision` installed from PyPI (CPU build) was incompatible with `torch` from CUDA wheel index, causing `operator torchvision::nms does not exist`.
+
+**Fix:** Added `torchvision` to the explicit torch install line in both setup scripts:
+```bash
+pip install torch torchaudio torchvision --index-url "$TORCH_INDEX"
+```
+
+Also added `--extra-index-url "$TORCH_INDEX"` to the `pip install -r requirements.txt` line to prevent transitive dependencies (timm, ultralytics) from pulling CPU-only torch packages.
+
+Applied to: `setup_env.sh` (both conda and venv paths), `setup-windows.ps1`.
+
+### EAT / transformers 5.x incompatibility
+
+**Problem:** transformers 5.x broke EAT's remote model code (`'EATModel' object has no attribute 'all_tied_weights_keys'`).
+
+**Fix:** Pinned `transformers>=4.30,<5.0` in `requirements.txt`.
+
+### NumPy non-writable array warning
+
+**Problem:** Cached HuBERT/EAT embeddings loaded from disk are read-only numpy arrays. `torch.from_numpy()` on a non-writable array triggers a deprecation warning.
+
+**Fix:** In `core/audio_scan.py`, changed EAT preprocessing to copy the array:
+```python
+wav = torch.from_numpy(np.array(chunk)).unsqueeze(0).float()
+```
+
+### Timeline not updating on tab switch
+
+**Problem:** Switching model tabs in the scan results panel didn't refresh the timeline's highlighted regions because `tab_changed` was only connected to `_update_scan_export_count`.
+
+**Fix:** Connected `tab_changed` to `_on_scan_regions_edited` instead, which handles both timeline refresh and export count update.
+
+---
+
+## File Summary
+
+| File | Changes |
+|------|---------|
+| `core/db.py` | Schema migrations, `get_export_folders` filter, versioned `save_scan_results`, `get_scan_versions`, version-aware `get_scan_results`, `add_hard_negatives` with `source_model`, `get_hard_negatives`, `delete_hard_negatives_by_ids`, `get_training_data` with `use_hard_negatives` |
+| `main.py` | `HardNegativesDialog` class, `TrainDialog` hard neg toggle + manage button, `ScanResultsPanel` container/combo architecture, version combo population and switching, `current_model_name()`, tab-switch timeline fix |
+| `core/audio_scan.py` | `np.array(chunk)` copy for read-only numpy arrays in EAT preprocessing |
+| `requirements.txt` | `transformers>=4.30,<5.0` pin |
+| `setup_env.sh` | `torchvision` in torch install, `--extra-index-url` on requirements install |
+| `setup-windows.ps1` | `torchvision` in torch install, `--extra-index-url` on requirements install, removed skip-if-exists guard |
+| `tests/test_db.py` | 5 tests covering all DB-layer changes |