CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION

Authors

  • Mohammed Inam Ur Rahman, Mohd Uzair Arfani, Mohammed Zain Ulhaq Ansari B.E. Student, Department of IT, Lords Institute of Engineering and Technology, Hyderabad Author
  • Dr.K.Nagi Reddy Professor, Department of IT, Lords Institute of Engineering and Technology, Hyderabad Author

Abstract

Traditional methods in audio analytics typically rely on supervised learning paradigms, where models are
trained with a single class label assigned to numerous audio recordings, limiting their adaptability and requiring
extensive labeled data. In contrast, we propose an innovative approach termed Contrastive Language-Audio
Pretraining (CLAP). This method leverages natural language supervision to imbue audio understanding, employing
dual encoders and contrastive learning to unify textual descriptions with audio signals in a cohesive multimodal
framework. Our training utilized a dataset of 128,000 paired audio-text samples and evaluated CLAP across 16
diverse downstream tasks spanning domains like Sound Event Classification, Musical analysis, and Speech-related
applications. Despite using fewer training pairs compared to analogous computer vision models,[1]CLAP achieves
state-of-the-art performance in Zero-Shot scenarios. Furthermore, in supervised learning setups, it sets new
benchmarks in 5 specific tasks. Thus, CLAP's innovative Zero-Shot capability eliminates the necessity for exhaustive
class labeling during training, enabling flexible and generalized class predictions across various applications.

Downloads

Published

2024-08-28

How to Cite

CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION. (2024). International Journal of Engineering and Science Research, 14(3), 507-515. https://ijesr.org/index.php/ijesr/article/view/946