Files
8-cut/docs/plans/2026-04-19-audio-pipeline-improvements-design.md
T

3.7 KiB

Audio Pipeline Improvements Design

Date: 2026-04-19

Goal

Improve audio scan classification accuracy, especially for non-speech sounds (suction, gagging, impacts), through three changes:

  1. Multi-layer feature extraction from existing HuBERT/Wav2Vec2 models
  2. Two new embedding models: AST (AudioSet-supervised) and EAT (self-supervised + AudioSet finetuned)
  3. Calibrated classifier for better threshold behavior

1. Multi-Layer Feature Extraction

Current behavior

model(waveforms) extracts embeddings from the last transformer layer only.

Change

Use model.extract_features(waveforms) (torchaudio API) to get all layer outputs. Select layers at quartile boundaries, mean-pool each over time, concatenate.

Model Layers Single-layer dim Multi-layer dim (4 quartiles)
HUBERT_XLARGE 48 1280 5120
HUBERT_LARGE 24 1024 4096
HUBERT_BASE 12 768 3072
WAV2VEC2_BASE 12 768 3072

Implementation

  • New entries in _EMBED_MODELS: "HUBERT_XLARGE_ML" -> 5120, etc.
  • _extract_w2v_windows: when model name ends with _ML, call extract_features() instead of model(), select quartile layers, concat
  • Cache key: model name includes _ML suffix -> separate cache files
  • No change to classifier or training pipeline (HistGBT handles high-dim fine)

2. AST (Audio Spectrogram Transformer)

What

MIT/ast-finetuned-audioset-10-10-0.4593 via HuggingFace transformers. 86M params, 768-dim, supervised on AudioSet 527 sound classes.

Integration

  • Load: ASTModel.from_pretrained() + ASTFeatureExtractor
  • Preprocessing: ASTFeatureExtractor handles mel spectrogram from 16kHz raw audio
  • Batching: prepare input_values per window, stack into batch, forward through model
  • Multi-layer: output_hidden_states=True returns 13 layers; AST_ML variant concats quartile layers -> 3072-dim
  • Model cached via _get_w2v_model() same lazy-load pattern

Entries

  • "AST" -> 768
  • "AST_ML" -> 3072

3. EAT (Efficient Audio Transformer)

What

worstchan/EAT-base_epoch30_finetune_AS2M via HuggingFace with trust_remote_code=True. 88M params, 768-dim, self-supervised + AudioSet finetuned.

Integration

  • Load: AutoModel.from_pretrained(..., trust_remote_code=True)
  • Preprocessing: manual 128-bin Kaldi fbank mel spectrogram via torchaudio, normalize with EAT constants (mel - (-4.268)) / (4.569 * 2), reshape to [B, 1, T, 128]
  • Feature extraction: model.extract_features(mel) returns [B, seq, 768]; CLS token [:, 0, :] for utterance-level, or mean-pool [:, 1:, :] for frame-level. Use mean-pool for consistency with other models.
  • Multi-layer: not natively supported, skip for now

Entry

  • "EAT" -> 768

4. Calibrated Classifier

Wrap HistGradientBoostingClassifier in CalibratedClassifierCV(clf, cv=3, method='isotonic') after fitting. Gives well-calibrated probabilities -> threshold slider maps more linearly to precision/recall.

One change in train_classifier(), no UI changes needed.

5. Requirements

Add to requirements.txt:

transformers>=4.30
timm>=0.9

Both AST and EAT need transformers. EAT additionally needs timm (used internally by its custom model code). Both setup scripts (setup_env.sh, setup-windows.ps1) install from requirements.txt so no changes needed there.

Cache Compatibility

  • All new model variants get distinct cache keys via model name in the hash
  • Existing caches for HUBERT_XLARGE, BEATs, etc. remain valid and untouched
  • New models create new .npz files in the same cache/w2v/ directory

UI Changes

  • _EMBED_MODELS dict additions appear automatically in Train dialog model dropdown and scan model dropdown
  • No other UI changes needed