- Export layout changed from clip_NNN group dirs to vid_NNN per-video folders - Automatic DB migration rewrites old paths and moves files on startup - Per-video counter with DB cross-check to prevent overwrites - Changelog popup on version bump with "don't show again" checkbox - Scan region resize now requires Shift+drag to prevent accidental edits - Recalculate vid folder and counter on file load - Add EAT_LARGE embedding model variant - Update tests for new flat export path structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9.1 KiB
Scan History & Hard Negative Management — Final Design
Date: 2026-04-19 (implemented on feat/training-ui)
Goal
- Keep scan result history per
(file, model)so users can track classifier improvement across training iterations - Make hard negatives manageable — viewable, removable, and optionally disabled per training run
- Fix latent bug:
get_export_folders()doesn't filter byscan_export
1. Ghost Folder Fix
Bug
get_export_folders() queried all output_path rows without filtering scan_export. Folders that only contained scan-exported clips appeared in training dropdowns with 0 clips.
Implementation (core/db.py)
get_export_folders(profile, include_scan_exports=False) — new parameter. When False (default), the SQL query adds AND scan_export = 0 to exclude scan-only folders. The get_training_stats() method passes this through and also filters its return dict to remove folders with 0 clips:
return {k: v for k, v in stats.items() if v["clips"] > 0}
Test
tests/test_db.py::test_export_folders_excludes_scan_exports — verifies scan-only folders are excluded by default and included when include_scan_exports=True.
2. Scan Result History
Schema
Added column to scan_results:
scan_timestamp TEXT NOT NULL DEFAULT ''
All rows from the same scan share one timestamp string with microsecond precision (%Y%m%d_%H%M%S_%f, e.g. "20260419_143022_123456"). Microsecond precision prevents version collisions on fast successive scans.
Migration adds the column via ALTER TABLE for existing databases. Legacy rows keep scan_timestamp = ''.
DB methods (core/db.py)
save_scan_results(filename, profile, model, regions, max_versions=5)
- Inserts new rows with current microsecond-precision timestamp
- Counts distinct timestamps for this
(filename, profile, model) - Prunes oldest timestamps beyond
max_versions
No more DELETE-then-INSERT — all versions coexist in the table.
get_scan_versions(filename, profile, model)
Returns [{timestamp, count, max_score}, ...] ordered newest first. Filters scan_timestamp != '' so legacy rows don't appear as named versions.
get_scan_results(filename, profile, scan_timestamp=None)
- With
scan_timestamp: returns rows matching that exact version - Without (default): uses
INNER JOINsubquery withMAX(scan_timestamp)per model to return only the latest version. Legacy rows (empty timestamp) sort before any real timestamp, so they're returned when no versioned scans exist.
UI (main.py — ScanResultsPanel)
Each model tab wraps its QTableWidget in a container QWidget with a QComboBox for version selection:
container (QWidget)
├── cmb_version (QComboBox) — hidden when ≤ 1 version
└── table (QTableWidget)
Helper methods unwrap this container:
_current_table()— returnsQTableWidgetfrom active tab (handles both raw table and container)_tab_table(index)— same by tab index
Version combo is populated by _populate_version_combos() after every load_for_file() and add_scan_results() call. Labels use datetime.strptime parsing with try/except fallback for robustness:
2026-04-19 14:30 (12 regions, best: 0.95)
Version switching via _on_version_changed(model, idx):
- Reads
scan_timestampfrom combo'suserData - Calls
get_scan_results(filename, profile, scan_timestamp=ts) - Repopulates the table in-place
- Clears the undo stack — stale undo entries from a different version would corrupt data
- Emits
regions_editedto refresh the timeline
Tab switch connects tab_changed signal to _on_scan_regions_edited (not just _update_scan_export_count), so the timeline updates scan regions when switching model tabs.
Cache interaction
Embedding cache is per (file, model) and doesn't change across scans. History stores classified regions (start, end, score), not embeddings.
Test
tests/test_db.py::test_scan_result_history — saves 3 versions, verifies counts, ordering, and latest-by-default behavior.
3. Hard Negative Management
Schema
Added column to hard_negatives:
source_model TEXT NOT NULL DEFAULT ''
Migration adds the column via ALTER TABLE for existing databases.
DB methods (core/db.py)
add_hard_negatives(filename, profile, times, source_path="", source_model="") — now stores which embedding model produced the scan that led to the negative marking.
get_hard_negatives(profile) — returns all rows as [{id, filename, start_time, source_path, source_model}, ...] for the management dialog.
delete_hard_negatives_by_ids(ids) — bulk delete by row IDs.
get_training_data(..., use_hard_negatives=True) — new parameter. When False, the hard negatives query is skipped entirely. Non-destructive — negatives remain in DB.
Source model tracking (main.py)
_on_scan_negatives() now passes source_model=self._scan_panel.current_model_name() when marking negatives from scan results. current_model_name() extracts the model name from the active tab text (stripping the count suffix).
Training toggle (main.py — TrainDialog)
Checkbox "Use hard negatives in training" (default checked) with "Manage..." button in an HBox layout. The toggle:
- Updates live training stats preview via debounced
_update_stats() - Passes
use_hard_negativesthrough_open_train_dialog()toget_training_data()
Management dialog (main.py — HardNegativesDialog)
Accessible from TrainDialog's "Manage..." button. Features:
| Component | Details |
|---|---|
| Filter combo | (all) + each distinct source_model found in data |
| Summary label | <b>N</b> hard negatives |
| Table | File, Time ({:.1f}s), Source Model, hidden ID column |
| Delete Selected | Multi-select aware, skips hidden (filtered) rows |
| Clear All | Filter-aware: if a model filter is active, only deletes negatives for that model with an appropriate confirmation message. If (all), deletes everything. |
| Close | Closes dialog, triggers stats refresh in parent TrainDialog |
blockSignals(True) guards prevent spurious filter callbacks during _load() repopulation.
Tests
test_hard_negatives_source_model— verifies source_model stored and retrievedtest_training_data_skips_hard_negatives— verifiesuse_hard_negatives=Falseexcludes themtest_delete_hard_negatives_by_ids— verifies bulk deletion by ID
4. Runtime Fixes (discovered during testing)
EAT/torchvision ABI mismatch
Problem: torchvision installed from PyPI (CPU build) was incompatible with torch from CUDA wheel index, causing operator torchvision::nms does not exist.
Fix: Added torchvision to the explicit torch install line in both setup scripts:
pip install torch torchaudio torchvision --index-url "$TORCH_INDEX"
Also added --extra-index-url "$TORCH_INDEX" to the pip install -r requirements.txt line to prevent transitive dependencies (timm, ultralytics) from pulling CPU-only torch packages.
Applied to: setup_env.sh (both conda and venv paths), setup-windows.ps1.
EAT / transformers 5.x incompatibility
Problem: transformers 5.x broke EAT's remote model code ('EATModel' object has no attribute 'all_tied_weights_keys').
Fix: Pinned transformers>=4.30,<5.0 in requirements.txt.
NumPy non-writable array warning
Problem: Cached HuBERT/EAT embeddings loaded from disk are read-only numpy arrays. torch.from_numpy() on a non-writable array triggers a deprecation warning.
Fix: In core/audio_scan.py, changed EAT preprocessing to copy the array:
wav = torch.from_numpy(np.array(chunk)).unsqueeze(0).float()
Timeline not updating on tab switch
Problem: Switching model tabs in the scan results panel didn't refresh the timeline's highlighted regions because tab_changed was only connected to _update_scan_export_count.
Fix: Connected tab_changed to _on_scan_regions_edited instead, which handles both timeline refresh and export count update.
File Summary
| File | Changes |
|---|---|
core/db.py |
Schema migrations, get_export_folders filter, versioned save_scan_results, get_scan_versions, version-aware get_scan_results, add_hard_negatives with source_model, get_hard_negatives, delete_hard_negatives_by_ids, get_training_data with use_hard_negatives |
main.py |
HardNegativesDialog class, TrainDialog hard neg toggle + manage button, ScanResultsPanel container/combo architecture, version combo population and switching, current_model_name(), tab-switch timeline fix |
core/audio_scan.py |
np.array(chunk) copy for read-only numpy arrays in EAT preprocessing |
requirements.txt |
transformers>=4.30,<5.0 pin |
setup_env.sh |
torchvision in torch install, --extra-index-url on requirements install |
setup-windows.ps1 |
torchvision in torch install, --extra-index-url on requirements install, removed skip-if-exists guard |
tests/test_db.py |
5 tests covering all DB-layer changes |