An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data

Yingxin Gu; Bruce K. Wylie; Stephen P. Boyte; Joshua J. Picotte; Danny Howard; Kelcy Smith; Kurtis Nelson

doi:10.3390/rs8110943

An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data

Remote Sensing

By: Yingxin Gu, Bruce K. Wylie, Stephen P. Boyte, Joshua J. Picotte, Danny Howard, Kelcy Smith, and Kurtis Nelson

https://doi.org/10.3390/rs8110943

Metrics

Cited by publications in Crossref

Web analytics dashboard Metrics definitions

Links

More information: Publisher Index Page (via DOI)
Open Access Version: Publisher Index Page
Download citation as: RIS | Dublin Core

Abstract

Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MAD_training = 2.5 and MAD_testing = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.

Suggested Citation

Gu, Y., Wylie, B.K., Boyte, S.P., Picotte, J.J., Howard, D., Smith, K., and Nelson, K., 2016, An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data: Remote Sensing, v. 8, p. 1-13, https://doi.org/10.3390/rs8110943.

Additional publication details
Publication type	Article
Publication Subtype	Journal Article
Title	An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data
Series title	Remote Sensing
DOI	10.3390/rs8110943
Volume	8
Publication Date	November 11, 2016
Year Published	2016
Language	English
Publisher	MDPI
Contributing office(s)	Earth Resources Observation and Science (EROS) Center
Description	Article 943; 13 p.
First page	1
Last page	13