A probabilistic approach to training machine learning models using noisy data

Environmental Modelling & Software
By:  and 

Links

Abstract

Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.

Publication type Article
Publication Subtype Journal Article
Title A probabilistic approach to training machine learning models using noisy data
Series title Environmental Modelling & Software
DOI 10.1016/j.envsoft.2024.106133
Volume 179
Year Published 2024
Language English
Publisher Elsevier
Contributing office(s) California Water Science Center
Description 106133, 15 p.
Google Analytic Metrics Metrics page
Additional publication details