Script Note for Journal Presentation
Summary: Using the DL technique to identify birds from their sound. Traditionally, it is done by an observer, which is challenging and biased. But author developed BirdNet, which can classify bird species from their sound.
6/1 - Why does this paper matter - intro:
Monitor status and trend of diversity, bird are often the target because they live in most environments, and niches, and are also the baseline for ecosystem health.
Many bird species also have cultural significance. In short, they are important.
Traditionally, they are monitored by point count - domain experts visually and aurally identify birds in 5-10 intervals at sampling locations - which can be biased and difficult with lots of constraints.
In contrast, passive acoustic monitoring (PAM) uses autonomous recording units (ARUs) to monitor the acoustic environment continuously. Data collected this way is more cost-effective and more widely used in monitoring. Enables researchers to revisit the data to conduct additional analyses.
It is still challenging to analyze. (Manually extract what the species is for researchers). Deep neural networks (DNN) solutions are needed.
Bird Detection Challenge. BirdCLEF. Uses mel-frequency cepstral coefficient (mfcc) - fed into SVM, or nearest neighbor. Computational constraint in a large amount of image data.
CNN was introduced in 2016 - spectrograms and CNN.
The author presents BirdNet -built on previous success using CNNs and spectrogram data to classify 984 bird species.
So, this paper present a way to correctly identify bird species - which improved a lot from others in its time.
Its method - 2:
Sound Data -> Visualization (spectrogram) -> Augmentation -> Model -> Prediction.
Data:
List of 1049 species (most common species).
Collected data Xeno-canto (community-curated collection of recordings for these species). (500,000 recordings for over 10,000 bird species over 7000h)
Extra recording from Macauly Library of Natural Sounds - (750,000 audio recording of more than 10,000 species.)
Only select high-quality recordings - maximum of 500 recording for species - 226,078 audio files.
<10 audio recording species eliminated -> n=984
Non-event classes to train to ignore non-bird signals - other animals, nature noise, human vocal etc, we use 16 classes from AudioSet data to enrich and combined to other animal, humna and environmental noise. Final dataset is about 1000 classes and 3978h recording, and split into 80/10/10.
ARU data often overlapping vocalizations require annotation. (Low SNR) Need gemeralize. Use open source focal recordings - only signle clear audible bird species. So there's a domain shift from focal to soundscape spectrogram data. So there's also evaluation data of soundscape.
Pre-processing:
Input - specrogram as monochrome image. Avian vocal and auditory capabilities should be considered (not simply re-scaling arbitrary values to fit the size of model). Need high temporal resolution. (FFT - fast fourier transform window size of 10.7ms). Overlap of 25%, each frame is 8ms. Restrict between 150 Hz and 15kHz also leave room for pitch shifts during data augmentation.
Frequency - compression.
Spectrogram - 3s chunk of audio - based on average duration of species and also allows for data augmentation.
Extract segments containing vocalization - simple detrctor. (exact location)
Data augmentation:
the random shift in frequency and time.
random partial stretching in time and frequency.
Addition of noise from samples that were rejected in pre-processing.
Each was added with a probability of 0.5 and a maximum of 3 augmentations.
Model:
ResNet. Wide residual network provided similar performance compared to deep architectures, improved regularization in residual blocks and scaling of network in width. Followed this design of that paper and choose a network resulted in a sequence of 157 layers and 36 were weighted.
Three core components:
1: preprocessing block transformed the original input spectrogram before it was passed through a series of residual stacks.
2: sequence of residual blocks - extracted features that were eventually passed through the third component.
3: classification block.

Trained using 1.5 million spectrograms with a maximum of 3500 samples per class.
tried Cost-sensitive learning, but none of them improved the overall performance.
Utilized mixed-up training by randomly combining 3 spectrograms into 1 sample. The author used Adam optimizer with a learning rate of 0.001 and batch size of 32. Reduce lr by 0.5 factor whenever validation loss stalled. There was also a step-wise reduction on drop-out probabilities by 0.1, with an initial at 0.5. Early stopping with a cooldown of 3 epochs was applied. Knowledge distillation was used to train a born-again network.
Inference:
Skipped.
Results:Both sample-wise mAP and class-wise mean average precision. Top-1 accuracy, F0.5 and area under ROC curve. (weighted precision mean and recall score)
Focal Recording: mAP and cmAP:0.791 and 0.694. TOP1 0.777 and AUC 0.974. Works great on focal data.
Sounscape: drastically decreased - 0.414 on 2019 CLEF - best 0.260 on 2019.
3.3 - Continuous stream: if seasonal changes in avian diversity can be detected. Analyzed a audio data of a year - analyzed. - speed of processing is also quick! Created checklist - min 7 max 83. Focused on 121 species.
Determine correlation between checklist frequency and num of detections. r = 0.251 - so was able to detect with the checklist.
Discussion:
Trained model is capable of replicating patterns generated by human observer. Quality of soundscape affected result.
Device also matters. -> Gaps in acoustic domain still challenge despite sophiscated data augmentation.
Species repertoire size is not matter -> versatile species (confusion)
Noisy data also led to decreasing in performance.
Result is competitive - high number of covered species with F0.5 - 0.414, mAP 0.791 and 0.5High temporal resolution of input improved performance
Multi-label classification mixed up training increase performance.
Deeper layers did not perform better.
Deeper networks outperformed shallow layout when computational resources are limited.
Cost-sensitive learning did not improve score
Labelling of species.
Last updated