光华讲坛——海外名家讲堂
主题: Crowdsourcing Utilizing Subgroup Structure of Latent Factor Modeling基于潜在因子子群结构的众包(Crowdsourcing)
主讲人: 加州大学尔湾分校Annie Qu教授
主持人: 统计学院林华珍教授
时间:2023年7月2日(周日)上午11:00-12:00
举办地点:柳林校区弘远楼408会议室
主办单位: 统计学院 国际交流与合作处 科研处
主讲人简介:
Annie Qu,Chancellor’s Professor, Department of Statistics, University of California Irvine. Ph.D., Statistics, the Pennsylvania State University. Qu’s research focuses on solving fundamental issues regarding structured and unstructured large-scale data, and developing cutting-edge statistical methods and theory in machine learning and algorithms on personalized medicine, text mining, recommender systems, medical imaging data and network data analyses for complex heterogeneous data. The newly developed methods are able to extract essential and relevant information from large volume high-dimensional data. Her research has impacts in many fields such as biomedical studies, genomic research, public health research, social and political sciences.
Before she joins the UC Irvine, Dr. Qu is Data Science Founder Professor of Statistics, and the Director of the Illinois Statistics Office at the University of Illinois at Urbana-Champaign. She was awarded as Brad and Karen Smith Professorial Scholar by the College of LAS at UIUC, a recipient of the NSF Career award in 2004-2009. She is a Fellow of the Institute of Mathematical Statistics, a Fellow of the American Statistical Association, and a Fellow of American Association for the Advancement of Science. She is also a recipient of Medallion Award and Lecturer. She is JASA Theory and Methods co-editor in 2023-2025.
Annie Qu,加州大学尔湾分校统计系Chancellor’s Professor, 宾夕法尼亚州立大学统计学博士。她的研究兴趣集中在解决结构化和非结构化大规模数据的基本问题,开发个性化医疗的机器学习和算法、文本挖掘、推荐系统、医学影像数据和复杂异构数据的网络数据分析等方面的前沿统计方法和理论。新开发的方法能够从大量高维数据中提取必要的相关信息。她的研究在生物医学研究、基因组研究、公共卫生研究、社会和政治科学等多个领域都有影响。在加入加州大学尔湾分校之前,她是伊利诺伊大学厄巴纳-香槟分校统计学 Data Science Founder Professor,也是伊利诺伊大学厄巴纳-香槟分校统计学办公室主任。她被 UIUC 的 LAS 学院授予 Brad and Karen Smith Professorial Scholar,并在 2004-2009 年获得 NSF Career award。她是国际数理统计学会(IMS)、美国统计学会(ASA)和美国科学促进会(AAAS)的Fellow,她还是 Medallion Award and Lecturer 获得者。她是JASA Theory and Methods的co-editor(2023-2025)。
内容简介:
Crowdsourcing has emerged as an alternative solution for collecting large scale labels. However, the majority of recruited workers are not domain experts, so their contributed labels could be noisy. In this paper, we propose a two-stage model to predict the true labels for multicategory classification tasks in crowdsourcing. In the first stage, we fit the observed labels with a latent factor model and incorporate subgroup structures for both tasks and workers through a multi-centroid grouping penalty. Group-specific rotations are introduced to align workers with different task categories to solve multicategory crowdsourcing tasks. In the second stage, we propose a concordance-based approach to identify high-quality worker subgroups who are relied upon to assign labels to tasks. In theory, we show the estimation consistency of the latent factors and the prediction consistency of the proposed method. The simulation studies show that the proposed method outperforms the existing competitive methods, assuming the subgroup structures within tasks and workers. We also demonstrate the application of the proposed method to real world problems and show its superiority.
众包(Crowdsourcing)已经成为收集大型标签的另一种解决方案。然而,大多数被招募的员工都不是领域专家,所以他们提供的标签可能会包含很多噪声。在本次报告中,主讲人提出了一个两阶段模型来预测众包中多分类任务的真实标签。在第一阶段,用潜在因子模型拟合观察到的标签,并通过多质心分组判罚合并任务和员工的子组结构。引入每个员工组别的旋转矩阵,使同组员工在不同的任务类别中的表现保持一致,以解决多类别众包任务。在第二阶段,主讲人提出了一种基于一致性的方法来识别高质量的员工组,由他们来为任务分配标签。理论上,主讲人证明了潜在因子的估计一致性和所提出方法的预测一致性。模拟研究表明,在假设任务和员工内部的子群结构时该方法优于现有的竞争方法。主讲人还演示了该方法在实际问题中的应用,并展示了它的优越性。