Computational bioacoustics with deep learning: a review and roadmap

View article
Bioinformatics and Genomics

Main article text

 

Introduction

Survey Methodology

  • (bioacoust* OR ecoacoust* OR vocali* ORanimal callsOR

  • passive acoustic monitoringOR soundscape) AND (deep learningOR convolutional neural networkORrecurrent neural network) AND (animal OR bird* OR cetacean* OR insect* OR mammal*)

State of the Art and Recent Developments

The standard recipe for bioacoustic deep learning

  • Use one of the well-known CNN architectures (ResNet, VGGish, Inception, MobileNet), perhaps pretrained from AudioSet. (These are conveniently available within the popular DL Python frameworks PyTorch, Keras, TensorFlow).

  • The input will be spectrogram data, typically divided into audio clips of fixed size such as 1 s or 10 s, which is done so that a ‘batch’ of spectrograms fits easily into GPU memory. The spectrograms may be standard (linear-frequency), or mel spectrograms, or log-frequency spectrograms. The “pixels” in the spectrogram are magnitudes: typically these are log-transformed before use, but might not be, or alternatively transformed by per-channel energy normalisation (PCEN). There is no strong consensus on the ‘best’ spectrogram format—it is likely a simple empirical choice based on the frequency bands of interest in your chosen task and their dynamic ranges.

  • The list of labels to be predicted could concern species, individuals, call types, or something else. It may be a binary (yes/no) classification task, which could be used for detecting the presence (occupancy) of some sound. In many cases a list of species is used: modern DL can scale to many hundreds of species. The system may be configured to predict a more detailed output such as a transcription of multiple sound events; I return to this later.

  • Use data augmentation to artificially make a small bioacoustic training dataset more diverse (noise mixing, time shifting, mixup).

  • Although a standard CNN is common, CRNNs are also relatively popular, adding a recurrent layer (LSTM or GRU) after the convolutional layers, which can be achieved by creating a new network from scratch or by adding a recurrent layer to an off-the-shelf network architecture.

  • Train your network using standard good practice in deep learning (for example: Adam optimiser, dropout, early stopping, and hyperparameter tuning) (Goodfellow, Bengio & Courville, 2016).

  • Following good practice, there should be separate data(sub)sets for training, validation (used for monitoring the progress of training and for selecting hyperparameters), and final testing/evaluation. It is especially beneficial if the testing set represents not just unseen data items but novel conditions, to better estimate the true generalisability of the system (Stowell et al., 2019b). However, it is still common for the training/validation/testing data to be sampled from the same pool of source data.

  • Performance is measured using standard metrics such as accuracy, precision, recall, F-score, and/or area under the curve (AUC or AUROC). Since bioacoustic datasets are usually “unbalanced”, having many more items of one category than another, it is common to account for this—for example by using macro-averaging, calculating performance for each class and then taking the average of those to give equal weight to each class (Mesaros, Heittola & Virtanen, 2016).

Taxonomic coverage

Neural network architectures

Acoustic features: spectrograms, waveforms, and more

Classification, detection, clustering

  1. The first is detection as binary classification: for a given audio clip, return a binary yes/no decision about whether the signal of interest is detected within (Stowell et al., 2019a). This output would be described by a statistical ecologist as “occupancy” information (presence/absence). It is simple to implement since binary classification is a fundamental task in DL, and does not require data to be labelled in high-resolution detail. Perhaps for these reasons it is widely used in the surveyed literature (e.g. Mac Aodha et al., 2018; Prince et al., 2019; Kiskin et al., 2021; Bergler et al., 2019b; Himawan et al., 2018; Lostanlen et al., 2019b).

  2. The second is detection as transcription, returning slightly more detail: the start and end times of sound events (Morfi et al., 2019; Morfi et al., 2021b). In the DCASE series of challenges and workshops (Detection and Classification of Acoustic Scenes and Events), the task of transcribing sound events, potentially for multiple classes in parallel, is termed sound event detection (SED), and in the present review I will use that terminology. It has typically been approached by training DL to label each small time step (e.g. a segment of 10 ms or 1 s) as positive or negative, and sequences of positives are afterwards merged into predicted event regions (Kong, Xu & Plumbley, 2017; Madhusudhana et al., 2021; Marchal, Fabianek & Aubry, 2021).

  3. The third is the form common in image object detection, which consists of estimating multiple bounding boxes indicating object locations within an image. Transferred to spectrogram data, each bounding box would represent time and frequency bounds for an “object” (a sound event). This has not often been used in bioacoustics but may be increasing in interest (Venkatesh, Moffat & Miranda, 2021; Shrestha et al., 2021; Romero-Mujalli et al., 2021; Zsebök et al., 2019; Coffey, Marx & Neumaier, 2019).

Signal processing using deep learning

Small data: data augmentation, pre-training, embeddings

  • multi-task learning—another form of transfer learning, this involves training on multiple tasks simultaneously (Morfi & Stowell, 2018; Zeghidour et al., 2021; Cramer et al., 2020);

  • semi-supervised learning, which supplements labelled data with unlabelled data (Zhong et al., 2020b; Bergler et al., 2019a);

  • weakly-supervised learning, which allows for labelling that is imprecise or lacks detail (e.g. lacks start and end time of sound events) (Kong, Xu & Plumbley, 2017; Knight et al., 2017; Morfi & Stowell, 2018; LeBien et al., 2020);

  • self-supervised learning, which uses some aspect of the data itself as a substitute for supervised labelling. For example, Baevski et al. (2020) and Saeed, Grangier & Zeghidour (2021) present different self-supervised learning methods to pretrain a system, for use when large amounts of audio are available but no labels. In both of these works, the pretraining process optimises a NN to determine, for a given audio recording, which of a set of short audio segments genuinely comes from that recording. This contrastive learning task acts as a substitute for a truly semantic task, and performs well for speech and other audio. Since this is at heart an unsupervised learning approach, with no “guidance” on which aspects of the data are of interest, it remains to be seen how well it performs in bioacoustics where the key information may be only a small part of the overall energy of the signal;

  • few-shot learning, in which a system is trained across multiple similar tasks, in such a way that for a new unseen task (e.g. a new type of call to be detected) the system can perform well even with only one or very few examples of the new task (Morfi et al., 2021b; Acconcjaioco & Ntalampiras, 2021). A popular method for few-shot learning is to create embeddings using prototypical networks, which involve a customised loss function that aims to create an embedding having good “prototypes” (cluster centroids). Pons, Serrà & Serra (2019) determined this to outperform transfer learning for small-data scenarios, and it is the baseline considered in a recent few-shot learning bioacoustic challenge (Morfi et al., 2021b).

Generalisation and domain shift

Open-set and novelty

Context and auxiliary information

Perception

On-device deep learning

Workflows and other practicalities

A Roadmap for Bioacoustic Deep Learning

Maturing topics? Architectures and features

Learning without large datasets

Equal representation

Interfaces and visualisation

Under-explored machine learning tasks

Individual ID

Sound event detection and object detection

Spatial acoustics

Useful integration of outputs

  • “We wish to specifically highlight one subtler challenge, however, which we believe is substantially hindering progress: the need for better approaches for dealing with uncertainty in these indirect observations. […] First, machine learning classifiers must be specifically designed to return probabilistic, not binary, estimates of species occurrence in an image or recording. Second, statistical models must be designed to take this probabilistic classifier output as input data, instead of the more usual binary presence–absence data. The standard statistical models that are widely used in ecology and conservation, including generalized linear mixed models, generalized additive models and generalized estimating equations, are not designed for this type of input.” (Kitzes & Schricker, 2019)

Behaviour and multi-agent interactions

Low impact

Conclusions

Supplemental Information

Bibtex database of literature used.

DOI: 10.7717/peerj.13152/supp-1

Additional Information and Declarations

Competing Interests

Dan Stowell is an Academic Editor for PeerJ.

Author Contributions

Dan Stowell conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

This is a review paper and there is no raw data.

Funding

The author received no funding for this work.

232 Citations 12,977 Views 1,897 Downloads