TITLE:
Ultra-High Dimensional Feature Selection and Mean Estimation under Missing at Random
AUTHORS:
Wanhui Li, Guangming Deng, Dong Pan
KEYWORDS:
Ultrahigh-Dimensional Data, Missing Data, Sure Independent Screening, Mean Estimation
JOURNAL NAME:
Open Journal of Statistics,
Vol.13 No.6,
December
18,
2023
ABSTRACT: Next
Generation Sequencing (NGS) provides an effective basis for estimating the survival time of
cancer patients, but it also poses the problem of high data dimensionality, in
addition to the fact that some patients drop out of the study, making the data
missing, so a method for estimating the mean of the response variable with
missing values for the ultra-high dimensional datasets is needed. In this
paper, we propose a two-stage ultra-high dimensional variable screening method,
RF-SIS, based on random forest regression, which effectively solves the problem
of estimating missing values due to excessive data dimension. After the
dimension reduction process by applying RF-SIS, mean interpolation is executed
on the missing responses. The results of the simulated data show that compared
with the estimation method of directly deleting missing observations, the
estimation results of RF-SIS-MI have significant advantages in terms of the
proportion of intervals covered, the average length of intervals, and the
average absolute deviation.