• 统计研究中心
当前位置: 首页> 系列讲座> 正文

康奈尔大学宁洋博士:Optimal and Safe Estimation for High-Dimensional Semi-Supervised Learning

光华讲坛——社会名流与企业家论坛第 6184 期

Optimal and Safe Estimation for High-Dimensional Semi-Supervised Learning

主讲人康奈尔大学宁洋博士

主持人统计学院林华珍教授

时间2022622日(周三)上午9:30-10:30

直播平台及会议ID:腾讯会议,ID: 281-493-147

主办单位:统计研究中心和统计学院 科研处

主讲人简介:

宁洋老师博士毕业于Johns Hopkins University,生物统计系。毕业以后博士后师从Grace Yi, Han Liu。现在是Cornell University, 统计和数据科学系助理教授。主要研究兴趣包括,high-dimensional statistics, causal inference, semiparametric models.

内容提要:

There are many scenarios such as the electronic health records where the outcome is much more difficult to collect than the covariates. The data with the observed outcomes are called labeled, and those without the outcomes are referred to as unlabeled. In this paper, we consider the linear regression problem with such a data structure under the high dimensionality. Clearly, any supervised estimators can only use the labeled data. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation of the regression parameters of linear model in light of the fact that such linear models may be misspecified in data analysis. To address this problem, we first establish the minimax lower bound for parameter  estimation  in the semi-supervised setting. We show that the upper bound from the supervised estimators that only use the labeled data cannot attain this minimax lower bound; thus there is a gap between the two bounds. When the unknown conditional mean function is correctly specified, we close this gap by proposing a new optimal semi-supervised estimator which attains the lower bound and therefore improves the rate of the supervised estimators. However, the proposed estimator may suffer from a slower rate than the supervised estimators, if the conditional mean function is misspecified. To fix this problem, we improve our semi-supervised estimator via a two-step procedure. The resulting estimator, called safe semi-supervised estimator, remains minimax rate-optimal when the conditional mean function is correctly specified, and is no worse than the supervised estimators even if the conditional mean function is misspecified. Furthermore, we extend our idea to aggregate multiple semi-supervised estimators caused by different misspecifications of the conditional mean function. Extensive numerical simulations and a real data analysis are conducted to illustrate our theoretical results.

在许多情况下,如电子健康记录,其结果比协变量更难收集。有观察到结果的数据称为有标记的数据,没有观察到结果的数据称为无标记的数据。在本文中,我们考虑了高维下具有这种数据结构的线性回归问题。显然,任何有监督的估计器只能使用标记数据。我们的目标是研究何时以及如何利用未标记的数据来改进线性模型回归参数的估计,因为这样的线性模型在数据分析中可能会被错误地指定。为了解决这个问题,我们首先建立了半监督条件下参数估计的极大极小下界。我们证明了仅使用标签数据的监督估计量的上界不能达到这个极大极小下界;因此,这两个界限之间存在差距。当未知条件均值函数被正确指定时,我们通过提出一种新的最优半监督估计来填补这一空白,它达到了下界,从而提高了监督估计的速率。然而,如果条件均值函数被错误地指定,所提出的估计器可能会比监督估计器慢。为了解决这个问题,我们通过两个步骤改进了我们的半监督估计器。所得到的估计量称为安全半监督估计量,当条件均值函数被正确指定时,它仍然保持极大极小率-最优,即使条件均值函数被错误指定,它也不会比监督估计量差。此外,我们将这一思想推广到由条件均值函数的不同错误说明引起的多个半监督估计的聚合。大量的数值模拟和真实的数据分析,以说明我们的理论结果。




上一条:多伦多大学孔德含博士:Fighting Noise with Noise: Causal Inference with Many Candidate Instruments

下一条:中国人民大学朱利平教授: Test effects of high-dimensional covariates via aggregating cumulative covariances