Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data
Links
- More information: Publisher Index Page (via DOI)
- Open Access Version: Publisher Index Page
- Download citation as: RIS | Dublin Core
Abstract
Classification of massive datasets by machine learning (ML) algorithms is promising for many scientific domains, especially wildlife monitoring programs that rely on passive acoustic surveys for detecting species. However, treating ML-predicted class labels (e.g., species identity) as truth biases inferences of focal parameters within common modeling frameworks. One solution is to model the misclassification process explicitly using human-validated true-class labels for a subset of observations. Validation by experts can present a substantial bottleneck in otherwise efficient workflows that use ML predictions. Bioacoustics practitioners seek guidance on both the quantity and process for selecting ML-labeled data to validate by an expert. We derive an alternative model formulation that jointly models human-validated and ML-predicted class labels with an observed-data likelihood (ODL) and use empirically informed simulations motivated by a real-data application to explore different probability designs for selecting class labels for validation. Simulation results suggest that with smaller validation sets the ODL formulation increases computational speed and reduces estimation error compared to a default MCMC data augmentation routine. Our methodology is transferable to applications that treat predictions from classification algorithms as the response variable of interest.
| Publication type | Article |
|---|---|
| Publication Subtype | Journal Article |
| Title | Leveraging an observed-data likelihood improves the use of machine learning labels in a Bayesian hierarchical model for bioacoustic data |
| Series title | Annals of Applied Statistics |
| DOI | 10.1214/25-AOAS2096 |
| Volume | 19 |
| Issue | 4 |
| Year Published | 2025 |
| Language | English |
| Publisher | Project Euclid |
| Contributing office(s) | Northern Rocky Mountain Science Center |
| Description | 24 p. |
| First page | 2957 |
| Last page | 2980 |