Parametric statistics

Parametric statistics is a branch of statistics that is concerned with the analysis of and inference from data assuming that the underlying distribution, from which the observed data was drawn, can be described by a finite set of (unknown) parameters. In contrast, nonparametric statistics does not assume explicit (finite-parametric) mathematical forms for distributions when modeling data. However, it may make some assumptions about that distribution, such as continuity or symmetry, or even an explicit mathematical shape but have a model for a distributional parameter that is not itself finite-parametric.

Most well-known statistical methods are parametric. Regarding nonparametric (and semiparametric) models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".

Overview

The main goal of statistical inference is to provide methods to systematically analyse data and infer properties of the probability distribution from which the data was drawn. The fundamental assumption underlying parametric statistics is that the data distribution belongs to a more general family of distributions that can be parameterized by a finite number of parameters <math>\theta_1,\dots,\theta_p</math>, which are unknown. Such a family of distributions is called a parametric model. Typical questions in this setting are: The most common methods are the following.

Maximum Likelihood estimation (MLE): The model parameters are chosen such that the probability (or probability density) of the given observation is maximal.
Method of moments (MoM): If the model parameters can be expressed as functions <math>g_1,\dots,g_p</math> of the moments of the distribution, then the moment estimates of the parameters are <math>g_1\left(\sum_{i=1}^nX_i,\dots,\sum_{i=1}^nX_i^p\right),\dots,g_p\left(\sum_{i=1}^nX_i,\dots,\sum_{i=1}^n X_i^p\right)</math>.
Least square estimation (LSE): This method applies to a regression setting, where the data arises in pairs <math>(X_1,Y_1),\dots,(X_n,Y_n)</math> and a regression function <math>f</math> is to be determined. The model parameters are chosen such that the sum of squared differences <math>\sum_{i=1}^n(Y_i-f_\theta(X_i))^2</math> between the observed data and the model prediction is minimal. In fact, LSE is a special case of MLE, where the conditional distribution of <math>Y</math> given <math>X</math> is normally distributed.

Bayesian approaches

In a Bayesian approach, the data is not assumed to be generated by a distribution <math>L_{\theta^*}</math> for some true <math>\theta^*</math>. Instead, the set of all possible (or plausible) model parameters is initially weighted with an à-priori distribution <math>\pi</math> reflecting the statistician's prior belief. Given the observed data, the parameter distribution is updated via Bayes' rule, yielding the à-posteriori distribution <math>p_\theta</math> that is proportional to the likelihood <math>L_\theta</math> times the prior <math>\pi</math>. Bayesian estimators therefore rather give a best estimate given the statisticians beliefs., are estimators that have minimum variance among all unbiased estimators. Due to the bias-variance decomposition, they are optimal in the sense that they minimise the mean squared error among all unbiased estimators.

Let's assume a function of the model parameters <math>q(\theta)</math> needs to be estimated (e.g., mean, variance, median, or <math>\theta</math> itself). Most generally, if an UMVU estimator of <math>q(\theta)</math> with finite variance exists, then it must be unique due to the Rao-Blackwell theorem. If there exists a complete sufficient statistic <math>T</math> in the chosen statistical model, which is often easy to determine using the Neyman-Fisher theorem, then it holds the Lehmann-Scheffé theorem: An estimators of the form <math>f(T)</math> that is unbiased, is automatically UMVU. If such an estimator has finite variance for all <math>\theta</math>, then it is also the unique UMVUE.

Information inequality and exponential families

In regular statistical models, it can be shown that the variance of unbiased estimators cannot be arbitrarily small: Any unbiased estimator <math>T</math> of the quantity <math>q(\theta)</math> is bounded from below by the universal bound<math display="block">\mathrm{Var}_\theta[T(X)]\geq \nabla q(\theta)I(\theta)^{-1} \nabla q(\theta)^T,</math>where <math>I(\theta)</math> is the Fisher-information matrix of the statistical model. The right hand-side is the so-called Cramér-Rao bound.

Estimators for which the variance equals the Cramér-Rao bound are called efficient and are also UMVU by definition. However, not every UMVUE is efficient. In fact, an estimator <math>T</math> is efficient if and only if (i) the statistical model is an exponential family and (ii) <math>T</math> is the natural sufficient statistic.

Parametric models

The choice of the model, that is, the probability distribution from which the data is assumed to be drawn in density estimation problems or an assumed functional depence between pairs of data <math>X</math> and <math>Y</math> in regression/classification problems, lies at the core of parametric procedures. Here is a list of common models used in practice.

Density estimation

Exponential families (e.g. normal distribution, exponential distribution, log-normal distribution, Gamma distribution, Chi-squared distribution, Erlang distribution, Beta distribution, Gumbel distribution, Pareto distribution, (Negative-)Binomial distribution, Poisson distribution, geometric distribution)
Laplace distribution
Uniform distribution
Weibull distribution

Regression

linear model (special cases thereof are ANOVA and ANCOVA)
generalized linear model (GLM)
neural networks

Classification

logistic regression
linear discriminant analysis (LDA)
quadratic discriminant analysis (QDA)
neural networks

Example

The normal family of distributions all have the same general shape and are parameterized by mean and standard deviation. That means that if the mean and standard deviation are known and if the distribution is normal, the probability of any future observation lying in a given range is known.

Suppose that we have a sample of 99 test scores with a mean of 100 and a standard deviation of 1. If we assume all 99 test scores are random observations from a normal distribution, then we predict there is a 1% chance that the 100th test score will be higher than 102.33 (that is, the mean plus 2.33 standard deviations), assuming that the 100th test score comes from the same distribution as the others. Parametric statistical methods are used to compute the 2.33 value above, given 99 independent observations from the same normal distribution.

A non-parametric estimate of the same thing is the maximum of the first 99 scores. We don't need to assume anything about the distribution of test scores to reason that before we gave the test it was equally likely that the highest score would be any of the first 100. Thus there is a 1% chance that the 100th score is higher than any of the 99 that preceded it.

History

Parametric statistics was mentioned by R. A. Fisher in his work Statistical Methods for Research Workers in 1925, which created the foundation for modern statistics.