CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION
Abstract
Traditional methods in audio analytics typically rely on supervised learning paradigms, where models are
trained with a single class label assigned to numerous audio recordings, limiting their adaptability and requiring
extensive labeled data. In contrast, we propose an innovative approach termed Contrastive Language-Audio
Pretraining (CLAP). This method leverages natural language supervision to imbue audio understanding, employing
dual encoders and contrastive learning to unify textual descriptions with audio signals in a cohesive multimodal
framework. Our training utilized a dataset of 128,000 paired audio-text samples and evaluated CLAP across 16
diverse downstream tasks spanning domains like Sound Event Classification, Musical analysis, and Speech-related
applications. Despite using fewer training pairs compared to analogous computer vision models,[1]CLAP achieves
state-of-the-art performance in Zero-Shot scenarios. Furthermore, in supervised learning setups, it sets new
benchmarks in 5 specific tasks. Thus, CLAP's innovative Zero-Shot capability eliminates the necessity for exhaustive
class labeling during training, enabling flexible and generalized class predictions across various applications.