A probabilistic approach to training machine learning models using noisy data
Links
- More information: Publisher Index Page (via DOI)
- Open Access Version: Publisher Index Page
- Download citation as: RIS | Dublin Core
Abstract
Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.
Publication type | Article |
---|---|
Publication Subtype | Journal Article |
Title | A probabilistic approach to training machine learning models using noisy data |
Series title | Environmental Modelling & Software |
DOI | 10.1016/j.envsoft.2024.106133 |
Volume | 179 |
Year Published | 2024 |
Language | English |
Publisher | Elsevier |
Contributing office(s) | California Water Science Center |
Description | 106133, 15 p. |
Google Analytic Metrics | Metrics page |