Robust Estimation and Inference under Huber’s Contamination Model

时间:2021-03-25         阅读:

光华讲坛——社会名流与企业家论坛第5658期


主题:Robust Estimation and Inference under Huber’s Contamination Model

主讲人:匹兹堡大学 任钊副教授

主持人:统计学院 常晋源教授

时间2021年3月26日上午10:00-11:00

直播平台及会议ID:腾讯会议, 537 472 552

主办单位:数据科学与商业智能联合实验室 统计学院 科研处

主讲人简介:

Zhao Ren is an Associate Professor in the Department of Statistics at the University of Pittsburgh. Prior to joining Pitt, Dr. Ren obtained his Ph.D. in Statistics at Yale University in 2014. He is broadly interested in high-dimensional statistical inference, covariance/precision matrix estimation, graphical models, robust statistics, statistical machine learning, nonparametric function estimation and applications in statistical genomics.

任钊,匹兹堡大学统计系副教授。在加入匹兹堡之前,任博士于2014年在耶鲁大学获得了统计博士学位。他广泛关注高维统计推断、协方差/精确矩阵估计、图形模型、稳健统计、统计机器学习、非参数函数估计以及在统计基因组学中的应用等领域研究。

内容提要:

This talk describes some new challenges and results in statistical inference of regression and nonparametric estimation under the celebrated Huber’s contamination model, with a focus on the influence of contamination on the minimax rates.

In the first part of the talk, we study the robust estimation and inference problem for linear models in the increasing dimension regime. Given random design, we consider the conditional distributions of error terms are contaminated by some arbitrary distributi on (possibly depending on the covariates) with proportion ε but otherwise can also be heavy-tailed and asymmetric. We show that simple robust M-estimators such as Huber and smoothed Huber, with an additional intercept added in the model, can achieve the minimax rates of convergence under the l2 loss. In addition, two types of confidence intervals with root-n consistency are provided by a multiplier bootstrap technique when the necessary condition on contamination proportion ε = o(1/ n) holds. For a larger ε, we further propose a debiasing procedure to reduce the potential bias caused by contamination, and prove the validity of the debiased confidence interval. Our method can be extended to the communication-efficient distributed estimation and inference setting in a straightforward way.

In the second part of the talk, we address the problem of density function estimation in Rd under Lp losses (1≤p <∞) with contaminated data. We investigate the effects of contamination proportionεamong other key quantities on the corresponding minimax rates of convergence for both structured and unstructured contamination over a scale of the anisotropic Nikol’skii classes: for structured contamination, ε always appears linearly in the optimal rates while for unstructured contamination, the leading term of the optimal rate involving ε also relies on the smoothness of target density class and the specific loss function. The corresponding adaption theory is also investigated by establishing Lp risk oracle inequalities via novel Goldenshluger-Lepski-type methods. An interesting feature is that in certain situation adaptive estimation can become a much harder task with the presence of contamination.

Based on joint works with Wen-Xin Zhou and Peiliang Zhang.

本文描述了在著名的Huber污染模型下回归和非参数估计中的统计推断的一些新挑战和结果,并重点讨论了污染对极小极大率的影响。

本文第一部分研究了线性模型在增维情形下的鲁棒估计和推理问题。在给定的随机设计条件下,考虑到误差项的条件分布被一些具有比例ε的任意分布(可能取决于协变量)所污染,但在其他情况下也可能是重尾和不对称的。本文证明了在如Huber和平滑Huber这类简单的鲁棒M-估计中,在模型中加入一个额外的截距,可以使得在l2损失下达到极小极大的收敛速度。

此外,当污染比例ε= o(1/ n)的必要条件成立时,通过一个乘数自助技术能给出根n一致性的两种置信区间。对于较大的ε,本文进一步提出了一种无偏方法以减少由于污染导致的潜在偏差,同时证明了无偏置信区间的有效性。本文的方法可以推广到通信效率高的分布式估计和推理中。

第二部分将解决在Lp损失(1≤p<∞)和污染数据下Rd的密度函数估计问题。在各向异性Nikol'skii类的尺度上,本文研究了污染比例ε等关键量对结构污染和非结构污染相应的极大极小收敛速度的影响:对于结构污染,ε总是线性地出现在最优速率中,而对于非结构化污染,涉及ε的最优速率的前导项还依赖于目标密度类的光滑性和比损失函数。并通过新的Goldenshluger-Lepski型方法建立Lp风险预测不等式,研究了相应的自适应理论。一个有趣的特点是,在某些情况下,由于污染的存在,自适应估计会变得更加困难。

本文是与周文心、张培良合著。

最新信息