Open Journal of Statistics

Volume 1, Issue 3 (October 2011)

ISSN Print: 2161-718X   ISSN Online: 2161-7198

Google-based Impact Factor: 0.53  Citations  

Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters

HTML  Download Download as PDF (Size: 165KB)  PP. 205-211  
DOI: 10.4236/ojs.2011.13024    13,269 Downloads   23,326 Views  Citations

Affiliation(s)

.

ABSTRACT

Random Forest is an excellent classification tool, especially in the –omics sciences such as metabolomics, where the number of variables is much greater than the number of subjects, i.e., “n << p.” However, the choices for the arguments for the random forest implementation are very important. Simulation studies are performed to compare the effect of the input parameters on the predictive ability of the random forest. The number of variables sampled, m-try, has the largest impact on the true prediction error. It is often claimed that the out-of-bag error (OOB) is an unbiased estimate of the true prediction error. However, for the case where n << p, with the default arguments, the out-of-bag (OOB) error overestimates the true error, i.e., the random forest actually performs better than indicated by the OOB error. This bias is greatly reduced by subsampling without replacement and choosing the same number of observations from each group. However, even after these adjustments, there is a low amount of bias. The remaining bias occurs because when there are trees with equal predictive ability, the one that performs better on the in-bag samples will perform worse on the out-of-bag samples. Cross-validation can be performed to reduce the remaining bias.

Share and Cite:

M. Mitchell, "Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters," Open Journal of Statistics, Vol. 1 No. 3, 2011, pp. 205-211. doi: 10.4236/ojs.2011.13024.

Cited by

[1] Interpretability application of the Just-in-Time software defect prediction model
Journal of Systems and Software, 2022
[2] Mapping soil erodibility in southeast China at 250 m resolution: Using environmental variables and random forest regression with limited samples
International Soil and Water Conservation …, 2022
[3] Productivity-Based Land Suitability and Management Sensitivity Analysis: The Eucalyptus E. urophylla × E. grandis Case
Forests, 2022
[4] A panel of lipid markers for rice discrimination of Wuchang Daohuaxiang in China
Food Research …, 2022
[5] The second dimension of spatial association
International Journal of Applied Earth Observation and …, 2022
[6] Detection of Ecballium elaterium in hedgerow olive orchards using a low‐cost uncrewed aerial vehicle and open‐source algorithms
Pest Management …, 2022
[7] Is Infidelity Predictable? Using Explainable Machine Learning to Identify the Most Important Predictors of Infidelity
The Journal of Sex Research, 2022
[8] Optimization of Dry Weight Assessment in Hemodialysis Patients via Reinforcement Learning
IEEE Journal of …, 2022
[9] Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia
Computational and structural …, 2022
[10] Phenology-based classification of Sentinel-2 data to detect coastal mangroves
Geocarto …, 2022
[11] Application of Machine Learning Techniques for Predicting Potential Vehicle-to-Pedestrian Collisions in Virtual Reality Scenarios
Applied Sciences, 2022
[12] Lost in the Forest
bioRxiv, 2022
[13] The control of moldy risk during rice storage based on multivariate linear regression analysis and random forest algorithm
JUSTC, 2022
[14] Anomaly Detection in Software Defined Networks Using Ensemble Learning
Future of Information and …, 2022
[15] ASSESSMENT OF MACHINE LEARNING METHODS FOR MASS REAL ESTATE APPRAISAL
2022
[16] Using remote sensing and geographical information systems to classify local landforms using a pattern recognition approach for improved soil mapping
2022
[17] Análise da técnica deep forest para o problema de aprendizado de ranqueamento
2022
[18] Diagnóstico del nivel de severidad de fallo de diente roto en engranajes rectos usando algoritmos de aprendizaje automático y gráficas de Poincaré aplicados a la …
2022
[19] Étude de l'hétérogénéité génétique de la leucémie myéloïde aigue par analyse scRNA-seq.
2022
[20] 基于多元线性回归分析和随机森林算法的水稻贮藏霉变风险控制
JUSTC, 2022
[21] Psychological emotions-based online learning grade prediction via BP neural network
Frontiers in Psychology, 2022
[22] Unraveling the Importance of the Yangtze River and Local Catchment on Water Level Variations of Poyang Lake (China) After the Three Gorges Dam …
Frontiers in Earth …, 2022
[23] Modélisation prédictive par l'intelligence artificielle pour la prévision de l'état trophique dans la Lagune Nord de Tunis en utilisant la Chlorophylle-a en relation avec …
2021
[24] Variable Importance Measure System Based on Advanced Random Forest
Computer Modeling in …, 2021
[25] Machine Learning modeling techniques for forecasting the trophic state in a restored South Mediterranean lagoon using Chlorophyll-a in connection with the physico …
2021
[26] Statistics, machine learning and deep learning for population genetic inference
2021
[27] International Soil and Water Conservation Research
2021
[28] Anomaly Detection in Software-Defined Networks Using Cross-Validation
… on Electrical, Computer …, 2021
[29] Assessing The Impact of Land Cover On Groundwater Quality In a Smart City Using GIS And Machine Learning Algorithms
2021
[30] Metabolomic profiling reveals a differential role for hippocampal glutathione reductase in infantile memory formation
Elife, 2021
[31] Machine Learning Modeling Techniques for Forecasting the Trophic Level in a Restored South Mediterranean Lagoon Using Chlorophyll-a
Wetlands, 2021
[32] Artificial Intelligence-Empowered Chatbot for Effective COVID-19 Information Delivery to Older Adults
International Journal of …, 2021
[33] Application of random forest classification and remotely sensed data in geological mapping on the Jebel Meloussi area (Tunisia)
Arabian Journal of Geosciences, 2021
[34] Response of Mangrove Carbon Fluxes to Drought Stress Detected by Photochemical Reflectance Index
Remote Sensing, 2021
[35] Estimating traffic flow states with smart phone sensor data
Transportation research part C: emerging …, 2021
[36] OtoPair: Combining right and left eardrum otoscopy images to improve the accuracy of automated image analysis
Applied Sciences, 2021
[37] Differentiation model for insomnia disorder and the respiratory arousal threshold phenotype in obstructive sleep apnea in the taiwanese population based on …
Diagnostics, 2021
[38] Simulating Canopy Temperature Using a Random Forest Model to Calculate the Crop Water Stress Index of Chinese Brassica
Agronomy, 2021
[39] Effects of air pollution in spatio-temporal modeling of asthma-prone areas using a machine learning model
Termeh, A Sadeghi-Niaraki, SM Choi - Environmental Research, 2021
[40] Machine Learning modeling techniques for forecasting the trophic level in a restored South Mediterranean lagoon using Chlorophyll-α
2021
[41] Meta-learning-based prediction of different corn cultivars from color feature extraction
2021
[42] Uncovering the Most Important Factors for Predicting Sexual Desire Using Explainable Machine Learning
2021
[43] Outlier Prediction Using Random Forest Classifier
2021
[44] Improving goal outcomes through relational catalyst support
2021
[45] Texture analysis using machine learning–based 3-T magnetic resonance imaging for predicting recurrence in breast cancer patients treated with neoadjuvant …
2021
[46] Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care
2021
[47] CT radiomics-based prediction of anaplastic lymphoma kinase and epidermal growth factor receptor mutations in lung adenocarcinoma
2021
[48] On Fair Performance Comparison between Random Survival Forest and Cox Regression: An Example of Colorectal Cancer Study
2021
[49] APG: A novel python-based ArcGIS toolbox to generate absence-datasets for geospatial studies
2021
[50] Metabolome profiling of the developing murine lens
2021
[51] Assessment of urban cooling effect based on downscaled land surface temperature: A case study for Fukuoka, Japan
2021
[52] Görüntü işleme teknikleri kullanılarak bazı meyvelerin sınıflandırılması
Dissertation, 2020
[53] Uncovering the Most Important Factors for Predicting Sexual Desire using Interpretable Machine Learning
2020
[54] An in-depth study of random forests methodologies for short biomarker signature discovery
2020
[55] Risk Screening of Obstructive Sleep Apnea Syndrome by Body Profiles via Random Forests Model
2020
[56] RF-PCA: A New Solution for Rapid Identification of Breast Cancer Categorical Data Based on Attribute Selection and Feature Extraction
2020
[57] RAINFOREST: a random forest approach to predict treatment benefit in data from (failed) clinical drug trials
2020
[58] Is Infidelity Predictable? Using Interpretable Machine Learning to Identify the Most Important Predictors of Infidelity
2020
[59] Downscale MODIS land surface temperature based on three different models to analyze surface urban heat island: a case study of Hangzhou
Remote Sensing, 2020
[60] PSForest: Improving deep forest via feature pooling and error screening
Asian Conference on Machine Learning, 2020
[61] Classification of apple varieties: Comparison of ensemble learning and naive bayes algorithms in H2O framework
… Üniversitesi Ziraat Fakültesi …, 2020
[62] Application of machine learning methods for design of crystallisation processes
2020
[63] Earth observation based indication for avian species distribution models using the spectral trait concept and machine learning in an urban setting
2020
[64] Comparison of three similarity scores for bullet LEA matching
2020
[65] Predicting in vitro human mesenchymal stromal cell expansion based on individual donor characteristics using machine learning
2020
[66] Determination of Pear Cultivars (Pyrus communis L.) Based on Colour Change Levels by Using Data Mining
2020
[67] Multiple imputation using chained random forests: a preliminary study based on the empirical distribution of out-of-bag prediction errors
2020
[68] Tree aggregation for random forest class probability estimation
2020
[69] Metabolomic and Lipidomic Profiling of Bone Marrow Plasma Differentiates Patients with Monoclonal Gammopathy of Undetermined Significance from …
2020
[70] Scattering Feature Set Optimization and Polarimetric SAR Classification Using Object-Oriented RF-SFS Algorithm in Coastal Wetlands
2020
[71] Ensemble learning with member optimization for fault diagnosis of a building energy system
2020
[72] A visual complexity learning algorithm for modelling human performance in visual cognitive tests
2019
[73] Time Series Event Forecasting in Consumer Electronic Markets using Random Forests
2019
[74] FliPer : In search of solar-like pulsators among TESS targets
2019
[75] SINGLE CELL METABOLOMICS USING MASS SPECTROMETRY: DEVICES, METHODS AND APPLICATIONS
2019
[76] Internal leakage identification of hydraulic cylinder based on intrinsic mode functions with random forest
2019
[77] Characterizing boreal peatland plant composition and species diversity with hyperspectral remote sensing
2019
[78] High-Throughput Machine Learning from Electronic Health Records
2019
[79] FliPerClass: In search of solar-like pulsators among TESS targets
2019
[80] Effect of host genotype and Eimeria acervulina infection on the metabolome of meat-type chickens
2019
[81] Outpatient Readmission in Rheumatology: A Machine Learning Predictive Model of Patient's Return to the Clinic
2019
[82] Towards early monitoring of chemotherapy-induced drug resistance based on single cell metabolomics: Combining single-probe mass spectrometry with machine …
2019
[83] Incorporating the Plant Phenological Trajectory into Mangrove Species Mapping with Dense Time Series Sentinel-2 Imagery and the Google Earth Engine Platform
2019
[84] Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests
2019
[85] Applications of Time-to-Event Data Analysis in Root Cause Analysis of Medical Imaging Systems Master Thesis (1BM96)
Thesis, 2019
[86] Texture Analysis with 3.0-T MRI for Association of Response to Neoadjuvant Chemotherapy in Breast Cancer
2019
[87] Machine Learning Applications to Predict Road Crash and Soccer Game Outcomes
2019
[88] RESPONSE STYLES ANALYSE VON ONLINE BEWERTUNGEN AUS DEM YELP DATENSATZ
Thesis, 2018
[89] Cell Phone Distraction: Data Mining Application on Fatality Analysis Reporting System (FARS)
2018
[90] A procession of metabolic alterations accompanying muscle senescence in Manduca sexta
Scientific Reports, 2018
[91] MR Imaging of Rectal Cancer: Radiomics Analysis to Assess Treatment Response after Neoadjuvant Therapy
2018
[92] Experimental vaccination for onchocerciasis and the identification of early markers of protective immunity
2018
[93] On the overestimation of random forest's out-of-bag error
PLOS ONE, 2018
[94] MiRTaW: An Algorithm for Atmospheric Temperature and Water Vapor Profile Estimation from ATMS Measurements Using a Random Forests Technique
Remote Sensing, 2018
[95] Instance-level decision visualization of Random Forest models
Thesis, 2018
[96] . TREE AGGREGATION FOR RANDOM FOREST CLASS PROBABILITY ESTIMATION
2018
[97] Quantitative mapping and predictive modelling of Mn-nodules' distribution from hydroacoustic and optical AUV data linked by Random Forests machine learning
2018
[98] Canopy Cover Estimation from Landsat Images: Understory Impact onTop-of-canopy Reflectance in a Northern Hardwood Forest
2018
[99] Quantitative mapping and predictive modeling of Mn nodules' distribution from hydroacoustic and optical AUV data linked by random forests machine learning
2018
[100] Social anxiety in trans and gender diverse people
2018
[101] PREDICTIVE WETLAND MAPPING OF THE FWCP-PEACE REGION
2018
[102] PREDICTIVE WETLAND MAPPING OF THE WILLISTON DRAINAGE BASIN
2018
[103] Comparing Urban Vegetation Cover with Summer Land Surface Temperature in the Salt Lake Valley
2017
[104] Predicting Sense of Community and Participation by Applying Machine Learning to Open Government Data
Policy & Internet, 2017
[105] Multi-model estimation of understorey shrub, herb and moss cover in temperate forest stands by laser scanner data
Forestry: An International Journal of Forest Research, 2017
[106] Texture-Based brain tumor segmentation in MR images
2017
[107] Time Dependent Kernel Density Estimation: A New Parameter Estimation Algorithm, Applications in Time Series Classification and Clustering
2016
[108] Sub-Pixel Classification of MODIS EVI for Annual Mappings of Impervious Surface Areas
Remote Sensing, 2016
[109] What is the distribution of the number of unique original items in a bootstrap sample?
arXiv preprint arXiv:1602.05822, 2016
[110] Resampling approaches in biometrical applications
2016
[111] Determination of a robust metabolic barcoding model for chemotaxonomy in Aizoaceae species: expanding morphological and genetic understanding
2016
[112] Simulated microgravity enhances oligodendrocyte mitochondrial function and lipid metabolism
Journal of neuroscience research, 2016
[113] Resampling Approaches in Biometrical Applications: Developments in Random Forests and in Bootstrap-based Procedures
Dissertation, 2016
[114] Coherence and structure in aphasic and non-aphasic spoken discourse
2016
[115] La recherche de sous-groupes par Virtual Twins
2015
[116] Mapping Post deforestation Land Use in the Brazilian Amazon using Remote Sensing Time Series
2015
[117] Testing the reliability and stability of the internal accuracy assessment of Random Forest for classifying tree defoliation levels using different validation methods
Geocarto International, 2015
[118] SIFT-MS analysis of Iberian hams from pigs reared under different conditions
Meat science, 2015
[119] A targeted metabolomics approach toward understanding metabolic variations in rice under pesticide stress
Analytical biochemistry, 2015
[120] Computer assisted detection and characterisation of breast cancer in MRI
2015
[121] Studying the needed effort for identifying duplicate issues
Empirical Software Engineering, 2015
[122] Assessment of resistance spot welding quality based on ultrasonic testing and tree-based techniques
Journal of Materials Processing Technology, 2014
[123] Object-based extraction of bark beetle (Ips typographus L.) infestations using multi-date LANDSAT and SPOT satellite imagery
Progress in Physical Geography, 2014
[124] Avaluació de la capacitat predictiva de perfils d'expressió per el pronòstic del càncer colorectal
A Berenguer Llergo - upcommons.upc.edu, 2014
[125] Hooman Latifi, Bastian Schumann
Environ Monit Assess, 2014
[126] Fusion of voice signal information for detection of mild laryngeal pathology
Applied Soft Computing, 2014
[127] Biogeographic structure of the northeastern Pacific rocky intertidal: the role of upwelling and dispersal to drive patterns
Ecography, 2014
[128] Mining data with random forests: current options for real‐world applications
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2014
[129] Mitochondrial Morphological Features Are Associated with Fission and Fusion Events
PloS one, 2014
[130] Spatial characterization of bark beetle infestations by a multidate synergy of SPOT and Landsat imagery
Environmental monitoring and assessment, 2014
[131] The Differential Diagnosis of Crohn's Disease and Celiac Disease Using Nuclear Magnetic Resonance Spectroscopy
Applied Magnetic Resonance, 2014
[132] Comprehensive Strategy for Proton Chemical Shift Prediction: Linear Prediction with Nonlinear Corrections
Journal of chemical information and modeling, 2014
[133] Mitochondrial Morphological Features Are Associated with Fission and Fusion
2014
[134] Fariba Fathi, Laleh Majari Kasmaee, Kaveh Sohrabzadeh, Mohamad Rostami Nejad, Mohsen Tafazzoli & Afsaneh
2014
[135] The remote sensing of insect defoliation in Mopane woodland.
2014
[136] Anthropisation d'un système aquifère multicouche méditerranéen (Campo de Cartagena, SE Espagne). Approches hydrodynamique, géochimique et …
2014
[137] Anthropisation d'un système aquifer multicouche méditerranéen (Campo de Cartagena, SE Espagne). Approches hydrodynamique, géochimique et isotopique
Doctoral dissertation, Université de Montpellier 2, 2013
[138] The Remote Sensing of Insect Defoliation in Mopane Woodland
Thesis, 2013
[139] Anthropisation d'un système aquifère multicouche méditerranéen (Campo de Cartagena, SE Espagne): approches hydrodynamique, géochimique et isotopique.
Doctoral dissertation, Université Montpellier II-Sciences et Techniques du Languedoc, 2013
[140] Utilizing random forest analysis with otolith mass and total fish length to obtain rapid and objective estimates of fish age
Canadian Journal of Fisheries and Aquatic Sciences, 2013
[141] A metabonomics investigation of multiple sclerosis by nuclear magnetic resonance
Magnetic Resonance in Chemistry, 2013
[142] Relationship Between Serum Level of Selenium and Metabolites Using 1HNMR-Based Metabonomics in Parkinson's Disease
Applied Magnetic Resonance, 2013
[143] Measuring the effectiveness of conservation: A novel framework to quantify the benefits of sage-grouse conservation policy and easements in Wyoming
PloS one, 2013
[144] Identifying the origin of groundwater samples in a multi-layer aquifer system with Random Forest classification
Journal of Hydrology, 2013
[145] Anthropization of a semiarid Mediterranean multi-layer aquifer system (Campo de Cartagena, SE Spain): hydrodynamic, geochemical and isotopic approaches
2013
[146] Measuring the Effectiveness of Conservation: A Novel Framework to Quantify
2013
[147] Clinical value of prognosis gene expression signatures in colorectal cancer: a systematic review
PloS one, 2012
[148] Coal Pit Mapping with Random Forest-Based Ensemble Machine Learning at Lower Benue Trough
[149] Self-Optimizing Random Forests

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.