Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data

Annals of Applied Statistics
By: , and 

Links

Abstract

Classification of massive datasets by machine learning (ML) algorithms is promising for many scientific domains, especially wildlife monitoring programs that rely on passive acoustic surveys for detecting species. However, treating ML-predicted class labels (e.g., species identity) as truth biases inferences of focal parameters within common modeling frameworks. One solution is to model the misclassification process explicitly using human-validated true-class labels for a subset of observations. Validation by experts can present a substantial bottleneck in otherwise efficient workflows that use ML predictions. Bioacoustics practitioners seek guidance on both the quantity and process for selecting ML-labeled data to validate by an expert. We derive an alternative model formulation that jointly models human-validated and ML-predicted class labels with an observed-data likelihood (ODL) and use empirically informed simulations motivated by a real-data application to explore different probability designs for selecting class labels for validation. Simulation results suggest that with smaller validation sets the ODL formulation increases computational speed and reduces estimation error compared to a default MCMC data augmentation routine. Our methodology is transferable to applications that treat predictions from classification algorithms as the response variable of interest.

Publication type Article
Publication Subtype Journal Article
Title Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data
Series title Annals of Applied Statistics
DOI 10.1214/25-AOAS2096
Volume 19
Issue 4
Year Published 2025
Language English
Publisher Project Euclid
Contributing office(s) Northern Rocky Mountain Science Center
Description 24 p.
First page 2957
Last page 2980
Additional publication details