Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data

Jacob Oram; Katharine M. Banner; Christian Stratton; Andrew Hoegh; Kathryn Irvine

doi:10.1214/25-AOAS2096

Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data

Annals of Applied Statistics

By: Jacob Oram, Katharine M. Banner, Christian Stratton, Andrew Hoegh, and Kathryn Irvine

https://doi.org/10.1214/25-AOAS2096

Metrics

Web analytics dashboard Metrics definitions

Links

More information: Publisher Index Page (via DOI)
Open Access Version: Publisher Index Page
Download citation as: RIS | Dublin Core

Abstract

Classification of massive datasets by machine learning (ML) algorithms is promising for many scientific domains, especially wildlife monitoring programs that rely on passive acoustic surveys for detecting species. However, treating ML-predicted class labels (e.g., species identity) as truth biases inferences of focal parameters within common modeling frameworks. One solution is to model the misclassification process explicitly using human-validated true-class labels for a subset of observations. Validation by experts can present a substantial bottleneck in otherwise efficient workflows that use ML predictions. Bioacoustics practitioners seek guidance on both the quantity and process for selecting ML-labeled data to validate by an expert. We derive an alternative model formulation that jointly models human-validated and ML-predicted class labels with an observed-data likelihood (ODL) and use empirically informed simulations motivated by a real-data application to explore different probability designs for selecting class labels for validation. Simulation results suggest that with smaller validation sets the ODL formulation increases computational speed and reduces estimation error compared to a default MCMC data augmentation routine. Our methodology is transferable to applications that treat predictions from classification algorithms as the response variable of interest.

Suggested Citation

Oram, J., Banner, K.M., Stratton, C., Hoegh, A., Irvine, K., 2025, Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data: Annals of Applied Statistics, v. 19, no. 4, p. 2957-2980, https://doi.org/10.1214/25-AOAS2096.

Additional publication details
Publication type	Article
Publication Subtype	Journal Article
Title	Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data
Series title	Annals of Applied Statistics
DOI	10.1214/25-AOAS2096
Volume	19
Issue	4
Year Published	2025
Language	English
Publisher	Project Euclid
Contributing office(s)	Northern Rocky Mountain Science Center
Description	24 p.
First page	2957
Last page	2980