CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION

Mohammed Inam Ur Rahman, Mohd Uzair Arfani, Mohammed Zain Ulhaq Ansari; Dr.K.Nagi Reddy

Authors

Mohammed Inam Ur Rahman, Mohd Uzair Arfani, Mohammed Zain Ulhaq Ansari B.E. Student, Department of IT, Lords Institute of Engineering and Technology, Hyderabad Author
Dr.K.Nagi Reddy Professor, Department of IT, Lords Institute of Engineering and Technology, Hyderabad Author

Abstract

Traditional methods in audio analytics typically rely on supervised learning paradigms, where models are
trained with a single class label assigned to numerous audio recordings, limiting their adaptability and requiring
extensive labeled data. In contrast, we propose an innovative approach termed Contrastive Language-Audio
Pretraining (CLAP). This method leverages natural language supervision to imbue audio understanding, employing
dual encoders and contrastive learning to unify textual descriptions with audio signals in a cohesive multimodal
framework. Our training utilized a dataset of 128,000 paired audio-text samples and evaluated CLAP across 16
diverse downstream tasks spanning domains like Sound Event Classification, Musical analysis, and Speech-related
applications. Despite using fewer training pairs compared to analogous computer vision models,[1]CLAP achieves
state-of-the-art performance in Zero-Shot scenarios. Furthermore, in supervised learning setups, it sets new
benchmarks in 5 specific tasks. Thus, CLAP's innovative Zero-Shot capability eliminates the necessity for exhaustive
class labeling during training, enabling flexible and generalized class predictions across various applications.

CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION

Authors

Abstract

Downloads

Published

Issue

Section

How to Cite

Call For Paper

Submission

MenuBar

Visitors in IJESR

Images

Indexed

Information

Reach Us

Important Links

Downloads & Indexing

Ethics & Policies