主题：On Data Reduction of Big Data
主办单位：统计研究中心 统计学院 科研处
杨敏,其博士于2002年毕业于美国伊利诺伊大学芝加哥分校,现为美国伊利诺伊大学芝加哥分校教授,其研究方向包括实验设计，统计推断，抽样调查、纵向数据分析等。至今主持了美国自然科学研究基金项目5项；现为Statistica Sinica和JASA副主编，也是Journal of Statistical Theory and Practice的客座主编。在JASA和Annals的国际顶级期刊上发表文章12篇。
The big data paradigm has drawn a significant amount of attention in recent years as costs of acquiring and storing data have plummeted. Instead, bottlenecks have been shifted to fast and in-depth analysis. However, this shift has created its own set of problems, the most obvious one is that large datasets are often computationally expensive to process. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in Big Data analysis is data reduction. In this presentation, I will review some existing approaches in data reduction and introduce a new strategy called information-based optimal subdata selection (IBOSS). Under linear models set up, for both moderate and large number of covariates, theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to other approaches in term of parameter estimation and predictive performance. The results show that IBOSS strategy addresses the tradeoff between computation complexity and statistical efficiency adequately. Some ongoing research work as well as some open questions will also be discussed.