docs: audio pipeline improvements design — multi-layer, AST, EAT, calibration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 13:28:32 +02:00
parent 7855ea62c2
commit c5d613fc5f
1 changed files with 98 additions and 0 deletions
@@ -0,0 +1,98 @@
 # Audio Pipeline Improvements Design
 Date: 2026-04-19
 ## Goal
 Improve audio scan classification accuracy, especially for non-speech sounds (suction, gagging, impacts), through three changes:
 1. Multi-layer feature extraction from existing HuBERT/Wav2Vec2 models
 2. Two new embedding models: AST (AudioSet-supervised) and EAT (self-supervised + AudioSet finetuned)
 3. Calibrated classifier for better threshold behavior
 ## 1. Multi-Layer Feature Extraction
 ### Current behavior
 `model(waveforms)` extracts embeddings from the **last transformer layer only**.
 ### Change
 Use `model.extract_features(waveforms)` (torchaudio API) to get all layer outputs. Select layers at quartile boundaries, mean-pool each over time, concatenate.
 | Model | Layers | Single-layer dim | Multi-layer dim (4 quartiles) |
 |-------|--------|-------------------|-------------------------------|
 | HUBERT_XLARGE | 48 | 1280 | 5120 |
 | HUBERT_LARGE | 24 | 1024 | 4096 |
 | HUBERT_BASE | 12 | 768 | 3072 |
 | WAV2VEC2_BASE | 12 | 768 | 3072 |
 ### Implementation
 - New entries in `_EMBED_MODELS`: `"HUBERT_XLARGE_ML"` -> 5120, etc.
 - `_extract_w2v_windows`: when model name ends with `_ML`, call `extract_features()` instead of `model()`, select quartile layers, concat
 - Cache key: model name includes `_ML` suffix -> separate cache files
 - No change to classifier or training pipeline (HistGBT handles high-dim fine)
 ## 2. AST (Audio Spectrogram Transformer)
 ### What
 `MIT/ast-finetuned-audioset-10-10-0.4593` via HuggingFace `transformers`. 86M params, 768-dim, supervised on AudioSet 527 sound classes.
 ### Integration
 - Load: `ASTModel.from_pretrained()` + `ASTFeatureExtractor`
 - Preprocessing: `ASTFeatureExtractor` handles mel spectrogram from 16kHz raw audio
 - Batching: prepare `input_values` per window, stack into batch, forward through model
 - Multi-layer: `output_hidden_states=True` returns 13 layers; `AST_ML` variant concats quartile layers -> 3072-dim
 - Model cached via `_get_w2v_model()` same lazy-load pattern
 ### Entries
 - `"AST"` -> 768
 - `"AST_ML"` -> 3072
 ## 3. EAT (Efficient Audio Transformer)
 ### What
 `worstchan/EAT-base_epoch30_finetune_AS2M` via HuggingFace with `trust_remote_code=True`. 88M params, 768-dim, self-supervised + AudioSet finetuned.
 ### Integration
 - Load: `AutoModel.from_pretrained(..., trust_remote_code=True)`
 - Preprocessing: manual 128-bin Kaldi fbank mel spectrogram via torchaudio, normalize with EAT constants `(mel - (-4.268)) / (4.569 * 2)`, reshape to `[B, 1, T, 128]`
 - Feature extraction: `model.extract_features(mel)` returns `[B, seq, 768]`; CLS token `[:, 0, :]` for utterance-level, or mean-pool `[:, 1:, :]` for frame-level. Use mean-pool for consistency with other models.
 - Multi-layer: not natively supported, skip for now
 ### Entry
 - `"EAT"` -> 768
 ## 4. Calibrated Classifier
 Wrap `HistGradientBoostingClassifier` in `CalibratedClassifierCV(clf, cv=3, method='isotonic')` after fitting. Gives well-calibrated probabilities -> threshold slider maps more linearly to precision/recall.
 One change in `train_classifier()`, no UI changes needed.
 ## 5. Requirements
 Add to `requirements.txt`:
 ```
 transformers>=4.30
 timm>=0.9
 ```
 Both AST and EAT need `transformers`. EAT additionally needs `timm` (used internally by its custom model code). Both setup scripts (`setup_env.sh`, `setup-windows.ps1`) install from `requirements.txt` so no changes needed there.
 ## Cache Compatibility
 - All new model variants get distinct cache keys via model name in the hash
 - Existing caches for HUBERT_XLARGE, BEATs, etc. remain valid and untouched
 - New models create new `.npz` files in the same `cache/w2v/` directory
 ## UI Changes
 - `_EMBED_MODELS` dict additions appear automatically in Train dialog model dropdown and scan model dropdown
 - No other UI changes needed