Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.7 KiB
Audio Pipeline Improvements Design
Date: 2026-04-19
Goal
Improve audio scan classification accuracy, especially for non-speech sounds (suction, gagging, impacts), through three changes:
- Multi-layer feature extraction from existing HuBERT/Wav2Vec2 models
- Two new embedding models: AST (AudioSet-supervised) and EAT (self-supervised + AudioSet finetuned)
- Calibrated classifier for better threshold behavior
1. Multi-Layer Feature Extraction
Current behavior
model(waveforms) extracts embeddings from the last transformer layer only.
Change
Use model.extract_features(waveforms) (torchaudio API) to get all layer outputs. Select layers at quartile boundaries, mean-pool each over time, concatenate.
| Model | Layers | Single-layer dim | Multi-layer dim (4 quartiles) |
|---|---|---|---|
| HUBERT_XLARGE | 48 | 1280 | 5120 |
| HUBERT_LARGE | 24 | 1024 | 4096 |
| HUBERT_BASE | 12 | 768 | 3072 |
| WAV2VEC2_BASE | 12 | 768 | 3072 |
Implementation
- New entries in
_EMBED_MODELS:"HUBERT_XLARGE_ML"-> 5120, etc. _extract_w2v_windows: when model name ends with_ML, callextract_features()instead ofmodel(), select quartile layers, concat- Cache key: model name includes
_MLsuffix -> separate cache files - No change to classifier or training pipeline (HistGBT handles high-dim fine)
2. AST (Audio Spectrogram Transformer)
What
MIT/ast-finetuned-audioset-10-10-0.4593 via HuggingFace transformers. 86M params, 768-dim, supervised on AudioSet 527 sound classes.
Integration
- Load:
ASTModel.from_pretrained()+ASTFeatureExtractor - Preprocessing:
ASTFeatureExtractorhandles mel spectrogram from 16kHz raw audio - Batching: prepare
input_valuesper window, stack into batch, forward through model - Multi-layer:
output_hidden_states=Truereturns 13 layers;AST_MLvariant concats quartile layers -> 3072-dim - Model cached via
_get_w2v_model()same lazy-load pattern
Entries
"AST"-> 768"AST_ML"-> 3072
3. EAT (Efficient Audio Transformer)
What
worstchan/EAT-base_epoch30_finetune_AS2M via HuggingFace with trust_remote_code=True. 88M params, 768-dim, self-supervised + AudioSet finetuned.
Integration
- Load:
AutoModel.from_pretrained(..., trust_remote_code=True) - Preprocessing: manual 128-bin Kaldi fbank mel spectrogram via torchaudio, normalize with EAT constants
(mel - (-4.268)) / (4.569 * 2), reshape to[B, 1, T, 128] - Feature extraction:
model.extract_features(mel)returns[B, seq, 768]; CLS token[:, 0, :]for utterance-level, or mean-pool[:, 1:, :]for frame-level. Use mean-pool for consistency with other models. - Multi-layer: not natively supported, skip for now
Entry
"EAT"-> 768
4. Calibrated Classifier
Wrap HistGradientBoostingClassifier in CalibratedClassifierCV(clf, cv=3, method='isotonic') after fitting. Gives well-calibrated probabilities -> threshold slider maps more linearly to precision/recall.
One change in train_classifier(), no UI changes needed.
5. Requirements
Add to requirements.txt:
transformers>=4.30
timm>=0.9
Both AST and EAT need transformers. EAT additionally needs timm (used internally by its custom model code). Both setup scripts (setup_env.sh, setup-windows.ps1) install from requirements.txt so no changes needed there.
Cache Compatibility
- All new model variants get distinct cache keys via model name in the hash
- Existing caches for HUBERT_XLARGE, BEATs, etc. remain valid and untouched
- New models create new
.npzfiles in the samecache/w2v/directory
UI Changes
_EMBED_MODELSdict additions appear automatically in Train dialog model dropdown and scan model dropdown- No other UI changes needed