A probabilistic approach to training machine learning models using noisy data

Ayman H. Alzraiee; Richard G. Niswonger

doi:10.1016/j.envsoft.2024.106133

A probabilistic approach to training machine learning models using noisy data

Environmental Modelling & Software

By: Ayman H. Alzraiee and Richard G. Niswonger

https://doi.org/10.1016/j.envsoft.2024.106133

Links

More information: Publisher Index Page (via DOI)
Open Access Version: Publisher Index Page
Download citation as: RIS | Dublin Core

Abstract

Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.

Additional publication details
Publication type	Article
Publication Subtype	Journal Article
Title	A probabilistic approach to training machine learning models using noisy data
Series title	Environmental Modelling & Software
DOI	10.1016/j.envsoft.2024.106133
Volume	179
Year Published	2024
Language	English
Publisher	Elsevier
Contributing office(s)	California Water Science Center
Description	106133, 15 p.
Google Analytic Metrics	Metrics page