Mean-only vectors were too similar across different audio segments,
causing everything to match even at threshold 0.99. Adding std
captures temporal dynamics and makes the similarity scores much
more spread out.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>