Files

T

Ethanfel c5d613fc5f docs: audio pipeline improvements design — multi-layer, AST, EAT, calibration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-19 13:28:32 +02:00

3.7 KiB

Raw Blame History

Audio Pipeline Improvements Design

Date: 2026-04-19

Goal

Improve audio scan classification accuracy, especially for non-speech sounds (suction, gagging, impacts), through three changes:

Multi-layer feature extraction from existing HuBERT/Wav2Vec2 models
Two new embedding models: AST (AudioSet-supervised) and EAT (self-supervised + AudioSet finetuned)
Calibrated classifier for better threshold behavior

1. Multi-Layer Feature Extraction

Current behavior

model(waveforms) extracts embeddings from the last transformer layer only.

Change

Use model.extract_features(waveforms) (torchaudio API) to get all layer outputs. Select layers at quartile boundaries, mean-pool each over time, concatenate.

Model	Layers	Single-layer dim	Multi-layer dim (4 quartiles)
HUBERT_XLARGE	48	1280	5120
HUBERT_LARGE	24	1024	4096
HUBERT_BASE	12	768	3072
WAV2VEC2_BASE	12	768	3072

Implementation

New entries in _EMBED_MODELS: "HUBERT_XLARGE_ML" -> 5120, etc.
_extract_w2v_windows: when model name ends with _ML, call extract_features() instead of model(), select quartile layers, concat
Cache key: model name includes _ML suffix -> separate cache files
No change to classifier or training pipeline (HistGBT handles high-dim fine)

2. AST (Audio Spectrogram Transformer)

What

MIT/ast-finetuned-audioset-10-10-0.4593 via HuggingFace transformers. 86M params, 768-dim, supervised on AudioSet 527 sound classes.

Integration

Load: ASTModel.from_pretrained() + ASTFeatureExtractor
Preprocessing: ASTFeatureExtractor handles mel spectrogram from 16kHz raw audio
Batching: prepare input_values per window, stack into batch, forward through model
Multi-layer: output_hidden_states=True returns 13 layers; AST_ML variant concats quartile layers -> 3072-dim
Model cached via _get_w2v_model() same lazy-load pattern

Entries

"AST" -> 768
"AST_ML" -> 3072

3. EAT (Efficient Audio Transformer)

What

worstchan/EAT-base_epoch30_finetune_AS2M via HuggingFace with trust_remote_code=True. 88M params, 768-dim, self-supervised + AudioSet finetuned.

Integration

Load: AutoModel.from_pretrained(..., trust_remote_code=True)
Preprocessing: manual 128-bin Kaldi fbank mel spectrogram via torchaudio, normalize with EAT constants (mel - (-4.268)) / (4.569 * 2), reshape to [B, 1, T, 128]
Feature extraction: model.extract_features(mel) returns [B, seq, 768]; CLS token [:, 0, :] for utterance-level, or mean-pool [:, 1:, :] for frame-level. Use mean-pool for consistency with other models.
Multi-layer: not natively supported, skip for now

Entry

"EAT" -> 768

4. Calibrated Classifier

Wrap HistGradientBoostingClassifier in CalibratedClassifierCV(clf, cv=3, method='isotonic') after fitting. Gives well-calibrated probabilities -> threshold slider maps more linearly to precision/recall.

One change in train_classifier(), no UI changes needed.

5. Requirements

Add to requirements.txt:

transformers>=4.30
timm>=0.9

Both AST and EAT need transformers. EAT additionally needs timm (used internally by its custom model code). Both setup scripts (setup_env.sh, setup-windows.ps1) install from requirements.txt so no changes needed there.

Cache Compatibility

All new model variants get distinct cache keys via model name in the hash
Existing caches for HUBERT_XLARGE, BEATs, etc. remain valid and untouched
New models create new .npz files in the same cache/w2v/ directory

UI Changes

_EMBED_MODELS dict additions appear automatically in Train dialog model dropdown and scan model dropdown
No other UI changes needed

3.7 KiB Raw Blame History

Audio Pipeline Improvements Design

Goal

1. Multi-Layer Feature Extraction

Current behavior

Change

Implementation

2. AST (Audio Spectrogram Transformer)

What

Integration

Entries

3. EAT (Efficient Audio Transformer)

What

Integration

Entry

4. Calibrated Classifier

5. Requirements

Cache Compatibility

UI Changes

3.7 KiB

Raw Blame History