Utilizing Deep Learning for Audio Detection

In modern life, developing audio classifiers has become a crucial task. These classifiers play an essential role in various aspects, from identifying human voices to distinguishing between different bird calls and musical notes.

Recently, a deep learning audio classifier named ResNeSt has shown impressive results in the field of sound identification. This classifier, developed by Dmytro Karabash, Maxim Korotkov, and Tony Chen, was able to identify birds based on the Cornell Birdcall Identification Challenge, earning a silver medal (top 2%).

The ResNeSt audio classifier uses a sophisticated architecture that processes raw audio as a log-mel spectrogram input. It passes through the ResNeSt50 backbone, and the bi-GRU layers are employed to catch time-wise information and reduce the feature dimension. The features extracted contain both spatial and temporal information, which are further processed using RoI pooling and bi-GRU layers.

The testing was performed on the Colab platform to reproduce the performance. The speed of data processing is crucial when employing a deep learning model, and using a GPU for audio processing can boost the speed by about ten to one hundred times faster. The benchmark speed for processing audio data on a GPU using the torchlibrosa library is approximately 15 times faster than on a CPU.

To train the model, augmentations like pitch variation and masking some audio frames using SpecAugment were applied. An example of the augmented audio data includes a mixed version of Alder Flycatcher and American Avocet sounds.

The ultimate goal of developing such a deep learning audio classifier is to help identify sounds that may disturb sleep or cause confusion, such as unusual bird calls. Sound identification is important for human safety and plays a significant role in various aspects of life.

Interestingly, the deep learning architecture used by the team Dragonsong in the Cornell Birdcall Identification Kaggle Challenge also utilizes CNN, RNN, and Attention modules. This competition showcases the growing popularity of deep learning (DL) due to its accuracy and the improvement of computational devices like CPU and GPU.

In conclusion, the ResNeSt audio classifier is a significant step forward in the field of audio classification. Its ability to identify birds and process audio files with zero to few bird calls makes it a valuable tool for various applications, from environmental research to improving sleep quality.