834b89b682
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.6 KiB
3.6 KiB
Audio Similarity Scanning — Design
Goal: Scan a video's audio track and highlight segments that match the sound profile of existing reference clips, so the user can quickly find similar moments without scrubbing manually.
Runs in: Python/Qt client (main.py), not the server.
Core Module: core/audio_scan.py
New module alongside core/tracking.py. Two main functions:
build_profile(clip_paths: list[str]) -> dict— extracts MFCCs (20 coefficients) from each clip usinglibrosa, returns a profile containing both the averaged vector and individual clip vectors.scan_video(video_path: str, profile: dict, mode: str, threshold: float, hop: float) -> list[tuple[float, float, float]]— slides an 8s window across the video's audio, returns(start_time, end_time, score)tuples for segments above threshold.
Feature Extraction
- Audio loaded via
librosa.load()(handles video files directly, mono, 22050Hz). - MFCCs:
librosa.feature.mfcc(n_mfcc=20), averaged over time axis to produce a single vector per window/clip. - Similarity: cosine similarity (
numpydot product on L2-normalized vectors).
Matching Modes
- Average mode: Compare each window to the mean of all reference MFCC vectors. Fast, good when references are homogeneous.
- Nearest mode: Compare each window to every reference vector, take the max score. Better when references have variety within the style.
Parameters
threshold(float, 0.0–1.0): minimum cosine similarity to include a segment. Default 0.7.hop(float, seconds): step size for the sliding window. Default 1.0s.- Window size fixed at 8s to match reference clip length.
UI Integration in main.py
Controls
Added near the existing tracking checkbox area:
- "Scan" button — triggers audio scan on current video.
- Threshold slider (0.0–1.0, step 0.05) — controls match strictness.
- Mode combobox — "Average" / "Nearest".
- Reference source combobox — "Current Profile" / "Custom Folder" (shows folder picker when "Custom Folder" selected).
Scan Workflow
- User clicks Scan.
- Reference clips collected: either all export
output_pathvalues from the current profile (via DB) or all audio/video files in a custom folder. - Scan runs in a
QThreadso UI stays responsive. - On completion, results sent to Timeline widget via signal.
Timeline Display
- New
set_scan_regions(regions: list[tuple[float, float, float]])method on Timeline. - Drawn as semi-transparent colored rectangles behind existing markers.
- Color intensity proportional to score (brighter = higher match).
- Cleared on file change or re-scan.
Keyboard Shortcut
S— jump cursor to the next scan region (similar toMfor next marker).
Data Flow
Reference clips (DB export paths or folder)
|
librosa.load() each -> MFCC vectors (20-dim)
|
Profile: { mean_vector, clip_vectors[] }
|
Current video -> librosa.load() full audio (mono 22050Hz)
|
Sliding 8s window (hop=1s) -> MFCC per window
|
Cosine similarity vs profile -> score per position
|
Threshold filter -> [(start, end, score), ...]
|
Timeline: semi-transparent highlight regions
Performance
- 2-hour video at 22050Hz mono ~ 380MB memory.
- MFCC extraction + sliding window: ~10-30s.
- QThread keeps UI responsive.
What This Does NOT Do
- No DB schema changes — scan results are ephemeral (visual only).
- No auto-export — user decides what to cut.
- No server integration — runs entirely in the Python client.
- No GPU/ML model dependency — just librosa + numpy.