Files

T

Ethanfel 6c1d42adfe feat: vid folder layout, changelog popup, shift-to-resize, DB migration

- Export layout changed from clip_NNN group dirs to vid_NNN per-video folders
- Automatic DB migration rewrites old paths and moves files on startup
- Per-video counter with DB cross-check to prevent overwrites
- Changelog popup on version bump with "don't show again" checkbox
- Scan region resize now requires Shift+drag to prevent accidental edits
- Recalculate vid folder and counter on file load
- Add EAT_LARGE embedding model variant
- Update tests for new flat export path structure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-19 17:01:37 +02:00

9.1 KiB

Raw Blame History

Scan History & Hard Negative Management — Final Design

Date: 2026-04-19 (implemented on feat/training-ui)

Goal

Keep scan result history per (file, model) so users can track classifier improvement across training iterations
Make hard negatives manageable — viewable, removable, and optionally disabled per training run
Fix latent bug: get_export_folders() doesn't filter by scan_export

1. Ghost Folder Fix

Bug

get_export_folders() queried all output_path rows without filtering scan_export. Folders that only contained scan-exported clips appeared in training dropdowns with 0 clips.

Implementation (`core/db.py`)

get_export_folders(profile, include_scan_exports=False) — new parameter. When False (default), the SQL query adds AND scan_export = 0 to exclude scan-only folders. The get_training_stats() method passes this through and also filters its return dict to remove folders with 0 clips:

return {k: v for k, v in stats.items() if v["clips"] > 0}

Test

tests/test_db.py::test_export_folders_excludes_scan_exports — verifies scan-only folders are excluded by default and included when include_scan_exports=True.

2. Scan Result History

Schema

Added column to scan_results:

scan_timestamp TEXT NOT NULL DEFAULT ''

All rows from the same scan share one timestamp string with microsecond precision (%Y%m%d_%H%M%S_%f, e.g. "20260419_143022_123456"). Microsecond precision prevents version collisions on fast successive scans.

Migration adds the column via ALTER TABLE for existing databases. Legacy rows keep scan_timestamp = ''.

DB methods (`core/db.py`)

save_scan_results(filename, profile, model, regions, max_versions=5)

Inserts new rows with current microsecond-precision timestamp
Counts distinct timestamps for this (filename, profile, model)
Prunes oldest timestamps beyond max_versions

No more DELETE-then-INSERT — all versions coexist in the table.

get_scan_versions(filename, profile, model) Returns [{timestamp, count, max_score}, ...] ordered newest first. Filters scan_timestamp != '' so legacy rows don't appear as named versions.

get_scan_results(filename, profile, scan_timestamp=None)

With scan_timestamp: returns rows matching that exact version
Without (default): uses INNER JOIN subquery with MAX(scan_timestamp) per model to return only the latest version. Legacy rows (empty timestamp) sort before any real timestamp, so they're returned when no versioned scans exist.

UI (`main.py` — `ScanResultsPanel`)

Each model tab wraps its QTableWidget in a container QWidget with a QComboBox for version selection:

container (QWidget)
├── cmb_version (QComboBox) — hidden when ≤ 1 version
└── table (QTableWidget)

Helper methods unwrap this container:

_current_table() — returns QTableWidget from active tab (handles both raw table and container)
_tab_table(index) — same by tab index

Version combo is populated by _populate_version_combos() after every load_for_file() and add_scan_results() call. Labels use datetime.strptime parsing with try/except fallback for robustness:

2026-04-19 14:30 (12 regions, best: 0.95)

Version switching via _on_version_changed(model, idx):

Reads scan_timestamp from combo's userData
Calls get_scan_results(filename, profile, scan_timestamp=ts)
Repopulates the table in-place
Clears the undo stack — stale undo entries from a different version would corrupt data
Emits regions_edited to refresh the timeline

Tab switch connects tab_changed signal to _on_scan_regions_edited (not just _update_scan_export_count), so the timeline updates scan regions when switching model tabs.

Cache interaction

Embedding cache is per (file, model) and doesn't change across scans. History stores classified regions (start, end, score), not embeddings.

Test

tests/test_db.py::test_scan_result_history — saves 3 versions, verifies counts, ordering, and latest-by-default behavior.

3. Hard Negative Management

Schema

Added column to hard_negatives:

source_model TEXT NOT NULL DEFAULT ''

Migration adds the column via ALTER TABLE for existing databases.

DB methods (`core/db.py`)

add_hard_negatives(filename, profile, times, source_path="", source_model="") — now stores which embedding model produced the scan that led to the negative marking.

get_hard_negatives(profile) — returns all rows as [{id, filename, start_time, source_path, source_model}, ...] for the management dialog.

delete_hard_negatives_by_ids(ids) — bulk delete by row IDs.

get_training_data(..., use_hard_negatives=True) — new parameter. When False, the hard negatives query is skipped entirely. Non-destructive — negatives remain in DB.

Source model tracking (`main.py`)

_on_scan_negatives() now passes source_model=self._scan_panel.current_model_name() when marking negatives from scan results. current_model_name() extracts the model name from the active tab text (stripping the count suffix).

Training toggle (`main.py` — `TrainDialog`)

Checkbox "Use hard negatives in training" (default checked) with "Manage..." button in an HBox layout. The toggle:

Updates live training stats preview via debounced _update_stats()
Passes use_hard_negatives through _open_train_dialog() to get_training_data()

Management dialog (`main.py` — `HardNegativesDialog`)

Accessible from TrainDialog's "Manage..." button. Features:

Component	Details
Filter combo	`(all)` + each distinct `source_model` found in data
Summary label	`<b>N</b> hard negatives`
Table	File, Time (`{:.1f}s`), Source Model, hidden ID column
Delete Selected	Multi-select aware, skips hidden (filtered) rows
Clear All	Filter-aware: if a model filter is active, only deletes negatives for that model with an appropriate confirmation message. If `(all)`, deletes everything.
Close	Closes dialog, triggers stats refresh in parent TrainDialog

blockSignals(True) guards prevent spurious filter callbacks during _load() repopulation.

Tests

test_hard_negatives_source_model — verifies source_model stored and retrieved
test_training_data_skips_hard_negatives — verifies use_hard_negatives=False excludes them
test_delete_hard_negatives_by_ids — verifies bulk deletion by ID

4. Runtime Fixes (discovered during testing)

EAT/torchvision ABI mismatch

Problem: torchvision installed from PyPI (CPU build) was incompatible with torch from CUDA wheel index, causing operator torchvision::nms does not exist.

Fix: Added torchvision to the explicit torch install line in both setup scripts:

pip install torch torchaudio torchvision --index-url "$TORCH_INDEX"

Also added --extra-index-url "$TORCH_INDEX" to the pip install -r requirements.txt line to prevent transitive dependencies (timm, ultralytics) from pulling CPU-only torch packages.

Applied to: setup_env.sh (both conda and venv paths), setup-windows.ps1.

EAT / transformers 5.x incompatibility

Problem: transformers 5.x broke EAT's remote model code ('EATModel' object has no attribute 'all_tied_weights_keys').

Fix: Pinned transformers>=4.30,<5.0 in requirements.txt.

NumPy non-writable array warning

Problem: Cached HuBERT/EAT embeddings loaded from disk are read-only numpy arrays. torch.from_numpy() on a non-writable array triggers a deprecation warning.

Fix: In core/audio_scan.py, changed EAT preprocessing to copy the array:

wav = torch.from_numpy(np.array(chunk)).unsqueeze(0).float()

Timeline not updating on tab switch

Problem: Switching model tabs in the scan results panel didn't refresh the timeline's highlighted regions because tab_changed was only connected to _update_scan_export_count.

Fix: Connected tab_changed to _on_scan_regions_edited instead, which handles both timeline refresh and export count update.

File Summary

File	Changes
`core/db.py`	Schema migrations, `get_export_folders` filter, versioned `save_scan_results`, `get_scan_versions`, version-aware `get_scan_results`, `add_hard_negatives` with `source_model`, `get_hard_negatives`, `delete_hard_negatives_by_ids`, `get_training_data` with `use_hard_negatives`
`main.py`	`HardNegativesDialog` class, `TrainDialog` hard neg toggle + manage button, `ScanResultsPanel` container/combo architecture, version combo population and switching, `current_model_name()`, tab-switch timeline fix
`core/audio_scan.py`	`np.array(chunk)` copy for read-only numpy arrays in EAT preprocessing
`requirements.txt`	`transformers>=4.30,<5.0` pin
`setup_env.sh`	`torchvision` in torch install, `--extra-index-url` on requirements install
`setup-windows.ps1`	`torchvision` in torch install, `--extra-index-url` on requirements install, removed skip-if-exists guard
`tests/test_db.py`	5 tests covering all DB-layer changes

9.1 KiB Raw Blame History

Scan History & Hard Negative Management — Final Design

Goal

1. Ghost Folder Fix

Bug

Implementation (core/db.py)

Test

2. Scan Result History

Schema

DB methods (core/db.py)

UI (main.py — ScanResultsPanel)

Cache interaction

Test

3. Hard Negative Management

Schema

DB methods (core/db.py)

Source model tracking (main.py)

Training toggle (main.py — TrainDialog)

Management dialog (main.py — HardNegativesDialog)

Tests

4. Runtime Fixes (discovered during testing)

EAT/torchvision ABI mismatch

EAT / transformers 5.x incompatibility

NumPy non-writable array warning

Timeline not updating on tab switch

File Summary

9.1 KiB

Raw Blame History

Implementation (`core/db.py`)

DB methods (`core/db.py`)

UI (`main.py` — `ScanResultsPanel`)

DB methods (`core/db.py`)

Source model tracking (`main.py`)

Training toggle (`main.py` — `TrainDialog`)

Management dialog (`main.py` — `HardNegativesDialog`)