Files
8-cut/docs/plans/2026-04-17-audio-scan-design.md
2026-04-17 08:33:25 +02:00

3.6 KiB
Raw Permalink Blame History

Audio Similarity Scanning — Design

Goal: Scan a video's audio track and highlight segments that match the sound profile of existing reference clips, so the user can quickly find similar moments without scrubbing manually.

Runs in: Python/Qt client (main.py), not the server.


Core Module: core/audio_scan.py

New module alongside core/tracking.py. Two main functions:

  • build_profile(clip_paths: list[str]) -> dict — extracts MFCCs (20 coefficients) from each clip using librosa, returns a profile containing both the averaged vector and individual clip vectors.
  • scan_video(video_path: str, profile: dict, mode: str, threshold: float, hop: float) -> list[tuple[float, float, float]] — slides an 8s window across the video's audio, returns (start_time, end_time, score) tuples for segments above threshold.

Feature Extraction

  • Audio loaded via librosa.load() (handles video files directly, mono, 22050Hz).
  • MFCCs: librosa.feature.mfcc(n_mfcc=20), averaged over time axis to produce a single vector per window/clip.
  • Similarity: cosine similarity (numpy dot product on L2-normalized vectors).

Matching Modes

  • Average mode: Compare each window to the mean of all reference MFCC vectors. Fast, good when references are homogeneous.
  • Nearest mode: Compare each window to every reference vector, take the max score. Better when references have variety within the style.

Parameters

  • threshold (float, 0.01.0): minimum cosine similarity to include a segment. Default 0.7.
  • hop (float, seconds): step size for the sliding window. Default 1.0s.
  • Window size fixed at 8s to match reference clip length.

UI Integration in main.py

Controls

Added near the existing tracking checkbox area:

  • "Scan" button — triggers audio scan on current video.
  • Threshold slider (0.01.0, step 0.05) — controls match strictness.
  • Mode combobox — "Average" / "Nearest".
  • Reference source combobox — "Current Profile" / "Custom Folder" (shows folder picker when "Custom Folder" selected).

Scan Workflow

  1. User clicks Scan.
  2. Reference clips collected: either all export output_path values from the current profile (via DB) or all audio/video files in a custom folder.
  3. Scan runs in a QThread so UI stays responsive.
  4. On completion, results sent to Timeline widget via signal.

Timeline Display

  • New set_scan_regions(regions: list[tuple[float, float, float]]) method on Timeline.
  • Drawn as semi-transparent colored rectangles behind existing markers.
  • Color intensity proportional to score (brighter = higher match).
  • Cleared on file change or re-scan.

Keyboard Shortcut

  • S — jump cursor to the next scan region (similar to M for next marker).

Data Flow

Reference clips (DB export paths or folder)
    |
librosa.load() each -> MFCC vectors (20-dim)
    |
Profile: { mean_vector, clip_vectors[] }
    |
Current video -> librosa.load() full audio (mono 22050Hz)
    |
Sliding 8s window (hop=1s) -> MFCC per window
    |
Cosine similarity vs profile -> score per position
    |
Threshold filter -> [(start, end, score), ...]
    |
Timeline: semi-transparent highlight regions

Performance

  • 2-hour video at 22050Hz mono ~ 380MB memory.
  • MFCC extraction + sliding window: ~10-30s.
  • QThread keeps UI responsive.

What This Does NOT Do

  • No DB schema changes — scan results are ephemeral (visual only).
  • No auto-export — user decides what to cut.
  • No server integration — runs entirely in the Python client.
  • No GPU/ML model dependency — just librosa + numpy.