稳健统计-1

来源:百度文库 编辑:神马文学网 时间:2024/06/28 09:24:44
    All statisticalmethods rely explicitly or implicitly on a number of assumptions. These
assumptions generally aim at formalizing what the statistician knows or conjectures
about the data analysis or statistical modeling problem he or she is faced with, and
at the same time aim at making the resulting model manageable from the theoreti-
cal and computational points of view. However, it is generally understood that the
resulting formal models are simpli?cations of reality and that their validity is at
best approximate. The most widely used model formalization is the assumption that
the observed data have a normal (Gaussian) distribution. This assumption has been
present in statistics for two centuries, and has been the framework for all the classi-
cal methods in regression, analysis of variance and multivariate analysis. There have
been attempts to justify the assumption of normality with theoretical arguments, such
as the central limit theorem. These attempts, however, are easily proven wrong. The
main justi?cation for assuming a normal distribution is that it gives an approximate
representation to many real data sets, and at the same time is theoretically quite
convenient because it allows one to derive explicit formulas for optimal statistical
methods such as maximum likelihood and likelihood ratio tests, as well as the sam-
pling distribution of inference quantities such as t-statistics.We refer to such methods
as classical statistical methods, and note that they rely on the assumption that nor-
mality holds exactly.The classical statistics are by modern computing standards quite
easy to compute. Unfortunately theoretical and computational convenience does not
always deliver an adequate tool for the practice of statistics and data analysis。

It often happens in practice that an assumed normal distribution model (e.g., a
location model or a linear regression model with normal errors) holds approximately
in that it describes the majority of observations, but some observations follow a
different pattern or no pattern at all. In the case when the randomness in the model is
assigned to observational errors—as in astronomy, which was the ?rst instance of the
use of the least-squares method—the reality is that while the behavior of many sets of
data appeared rather normal, this held only approximately, with the main discrepancy
being that a small proportion of observations were quite atypical by virtue of being far
from the bulk of the data. Behavior of this type is common across the entire spectrum
of data analysis and statistical modeling applications. Such atypical data are called
outliers, and even a single outlier can have a large distorting in?uence on a classical
statistical method that is optimal under the assumption of normality or linearity. The
kind of “approximately” normal distribution that gives rise to outliers is one that has a
normal shape in the central region, but has tails that are heavier or “fatter” than those
of a normal distribution.
One might naively expect that if such approximate normality holds, then the
results of using a normal distribution theory would also hold approximately. This
is unfortunately not the case. If the data are assumed to be normally distributed
but their actual distribution has heavy tails, then estimates based on the maximum
likelihood principle not only cease to be “best” but may have unacceptably low
statistical ef?ciency (unnecessarily large variance) if the tails are symmetric and may
have very large bias if the tails are asymmetric. Furthermore, for the classical tests
their level may be quite unreliable and their power quite low, and for the classical
con?dence intervals their con?dence level may be quite unreliable and their expected
con?dence interval length may be quite large.
The robust approach to statistical modeling and data analysis aims at deriving
methods that produce reliable parameter estimates and associated tests and con?dence
intervals, not only when the data follow a given distribution exactly, but also when
this happens only approximately in the sense just described. While the emphasis
of this book is on approximately normal distributions, the approach works as well
for other distributions that are close to a nominal model, e.g., approximate gamma
distributions for asymmetric data. A more informal data-oriented characterization of
robust methods is that they ?t the bulk of the data well: if the data contain no outliers
the robustmethod gives approximately the same results as the classicalmethod, while
if a small proportion of outliers are present the robustmethod gives approximately the
same results as the classical method applied to the “typical” data. As a consequence
of ?tting the bulk of the data well, robust methods provide a very reliable method of
detecting outliers, even in high-dimensional multivariate situations.
We note that one approach to dealing with outliers is the diagnostic approach.
Diagnostics are statistics generally based on classical estimates that aim at giving
numerical or graphical clues for the detection of data departures from the assumed
model. There is a considerable literature on outlier diagnostics, and a good outlier
diagnostic is clearly better than doing nothing. However, these methods present two
drawbacks. One is that they are in general not as reliable for detecting outliers as
examining departures from a robust ?t to the data. The other is that, once suspicious
observations have been ?agged, the actions to be taken with themremain the analyst’s
personal decision, and thus there is no objective way to establish the properties of theresult of the overall procedure.

Robust methods have a long history that can be traced back at least to the end of
the nineteenth century with Simon Newcomb (see Stigler, 1973). But the ?rst great
steps forward occurred in the 1960s, and the early 1970swith the fundamentalwork of
John Tukey (1960, 1962), Peter Huber (1964, 1967) and Frank Hampel (1971, 1974).
The applicability of the new robust methods proposed by these researchers was made
possible by the increased speed and accessibility of computers. In the last four decades
the ?eld of robust statistics has experienced substantial growth as a research area, as
evidenced by a large number of published articles. In?uential books have beenwritten
by Huber (1981), Hampel, Ronchetti, Rousseeuw and Stahel (1986), Rousseeuw and
Leroy (1987) and Staudte and Sheather (1990). The research efforts of the current
book’s authors, many of which are re?ected in the various chapters, were stimulated
by the early foundation results, as well as work by many other contributors to the
eld, and the emerging computational opportunities for delivering robust methods to
users.
The above body of work has begun to have some impact outside the domain of
robustness specialists, and there appears to be a generally increased awareness of
the dangers posed by atypical data values and of the unreliability of exact model as-
sumptions. Outlier detection methods are nowadays discussed in many textbooks on
classical statistical methods, and implemented in several software packages. Further-
more, several commercial statistical software packages currently offer some robust
methods, with that of the robust library in S-PLUS being the currently most complete
and user friendly. In spite of the increased awareness of the impact outliers can have
on classical statistical methods and the availability of some commercial software,
robust methods remain largely unused and even unknown by most communities of
applied statisticians, data analysts, and scientists that might bene?t from their use. It
is our hope that this book will help to rectify this unfortunate situation.