主 题:Subsampling for Rare Events Data and maximum sampled conditional likelihood罕见事件数据的下采样和最大抽样条件似然
主讲人:美国康涅狄格大学王海鹰副教授
主持人:统计学院周岭教授
时间:2023年6月20日(周二)上午9:30-10:30
举办地点:腾讯会议,483-726-988
主办单位:统计研究中心和统计学院 科研处
主讲人简介:
HaiYing Wang is an Associate Professor in the Department of Statistics at the University of Connecticut. His research interests include informative subdata selection for big data, model selection, model averaging, measurement error models, and semi-parametric regression. His research has been published in top statistics and machine learning journals (e.g., Biometrika, IEEE Transactions on Information Theory, JASA, and JMLR) and conferences (e.g., ICML and NeurIPS).
HaiYing Wang,美国康涅狄格大学统计系副教授。主要研究方向为大数据信息性子数据选择、模型选择、模型平均、测量误差模型、半参数回归等。他的研究已发表在顶级统计和机器学习期刊(如Biometrika, IEEE Transactions on Information Theory, JASA和JMLR)和会议期刊(如ICML和NeurIPS)上。
内容简介:
In this talk, we show that the available information about unknown parameters in rare events data is only tied to the relatively small number of cases, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive an optimal sampling probability for the inverse probability weighted (IPW) estimator. We further we propose a likelihood-based estimator to further improve the estimation efficiency, and show that the improved estimator has the smallest asymptotic variance among a large class of estimators. It is also more robust to pilot misspecification. The likelihood-based estimator is also generalized to a class of models beyond binary response models. We validate our approach on simulated data, the MNIST data, and a real click-through rate dataset with more than 0.3 trillion instances.
在这次报告中,主讲人展示了关于罕见事件数据中未知参数的可用信息仅与相对较少的事件有关,这证明了负抽样的使用是合理的。然而,如果负面实例与与正面实例具有相同的下采样水平,就会有信息丢失。为了保持更多的信息,主讲人推导了逆概率加权(IPW)估计量的最优抽样概率。为了进一步提高估计效率,主讲人提出了一个基于似然的估计量,并证明了改进的估计量在一大类估计量中具有最小的渐近方差。在错误设定下它也更稳健。基于似然的估计量也推广到二元响应模型以外的一类模型。主讲人通过模拟数据、MNIST数据和一个拥有超过0.3万亿个实例的真实点击率数据集上验证了该方法。