光华讲坛——社会名流与企业家论坛第 6184 期
主 题:Optimal and Safe Estimation for High-Dimensional Semi-Supervised Learning
直播平台及会议ID:腾讯会议,ID: 281-493-147
主办单位:统计研究中心和统计学院 科研处
宁洋老师博士毕业于Johns Hopkins University,生物统计系。毕业以后博士后师从Grace Yi, Han Liu。现在是Cornell University, 统计和数据科学系助理教授。主要研究兴趣包括,high-dimensional statistics, causal inference, semiparametric models.
There are many scenarios such as the electronic health records where the outcome is much more difficult to collect than the covariates. The data with the observed outcomes are called labeled, and those without the outcomes are referred to as unlabeled. In this paper, we consider the linear regression problem with such a data structure under the high dimensionality. Clearly, any supervised estimators can only use the labeled data. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation of the regression parameters of linear model in light of the fact that such linear models may be misspecified in data analysis. To address this problem, we first establish the minimax lower bound for parameter estimation in the semi-supervised setting. We show that the upper bound from the supervised estimators that only use the labeled data cannot attain this minimax lower bound; thus there is a gap between the two bounds. When the unknown conditional mean function is correctly specified, we close this gap by proposing a new optimal semi-supervised estimator which attains the lower bound and therefore improves the rate of the supervised estimators. However, the proposed estimator may suffer from a slower rate than the supervised estimators, if the conditional mean function is misspecified. To fix this problem, we improve our semi-supervised estimator via a two-step procedure. The resulting estimator, called safe semi-supervised estimator, remains minimax rate-optimal when the conditional mean function is correctly specified, and is no worse than the supervised estimators even if the conditional mean function is misspecified. Furthermore, we extend our idea to aggregate multiple semi-supervised estimators caused by different misspecifications of the conditional mean function. Extensive numerical simulations and a real data analysis are conducted to illustrate our theoretical results.