Papers
arxiv:2309.15938

Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation

Published on Sep 27, 2023
Authors:
,
,

Abstract

In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We also perform a comprehensive analysis of the effect of each augmentation method and a comparison of the fine-tuning performance using different amounts of labeled data.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.15938 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.15938 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.15938 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.