Kernel density estimation

thumb|right|250px|Kernel density estimation of 100 [[normal distribution|normally distributed random numbers using different smoothing bandwidths.]]

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

Definition

Let <math>\mathbf{x} = \left ( x_1, x_2, x_3, ... \right )</math> be independent and identically distributed samples drawn from some univariate distribution with an unknown density at any given point . We are interested in estimating the shape of this function . Its kernel density estimator is

\hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K{\left(\frac{x-x_i}{h}\right)},

</math>

where is the kernel — a non-negative function — and is a smoothing parameter called the bandwidth or simply width. though the loss of efficiency is small for the kernels listed previously. Due to its convenient mathematical properties, the normal kernel is often used, which means , where is the standard normal density function. The kernel density estimator then becomes

\hat{f}_h(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h\sqrt{2\pi \exp\left(\frac{-(x-x_i)^2}{2h^2}\right),

</math>

where <math>h</math> is the standard deviation of the sample <math>\mathbf{x}</math>.

The construction of a kernel density estimate finds interpretations in fields outside of density estimation. For example, in thermodynamics, this is equivalent to the amount of heat generated when heat kernels (the fundamental solution to the heat equation) are placed at each data point locations . Similar methods are used to construct discrete Laplace operators on point clouds for manifold learning (e.g. diffusion map).

Example

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. The diagram below based on these 6 data points illustrates this relationship:

{| class="wikitable"

! Sample

| 1 || 2 || 3 || 4 || 5 || 6

! Value

| −2.1 || −1.3 || −0.4 || 1.9 || 5.1 || 6.2

For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data: In this case, six bins each of width 2. Whenever a data point falls inside this interval, a box of height 1/12 is placed there. If more than one data point falls inside the same bin, the boxes are stacked on top of each other.

For the kernel density estimate, normal kernels with a standard deviation of 1.5 (indicated by the red dashed lines) are placed on each of the data points x<sub>i</sub>. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate (compared to the discreteness of the histogram) illustrates how kernel density estimates converge faster to the true underlying density for continuous random variables.

thumb|center|500px|alt=Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The six individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.|Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The six individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the [[rug plot on the horizontal axis.]]

Bandwidth selection

thumb|Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.

The bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated random sample from the standard normal distribution (plotted at the blue spikes in the rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth h = 0.05, which is too small. The green curve is oversmoothed since using the bandwidth obscures much of the underlying structure. The black curve with a bandwidth of h = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density. An extreme situation is encountered in the limit <math>h \to 0</math> (no smoothing), where the estimate is a sum of delta functions centered at the coordinates of analyzed samples. In the other extreme limit <math>h \to \infty</math> the estimate retains the shape of the used kernel, centered on the mean of the samples (completely smooth).

The most common optimality criterion used to select this parameter is the expected L<sub>2</sub> risk function, also termed the mean integrated squared error:

<math display="block">\operatorname{MISE} (h) = \operatorname{E}\!\left[ \int\! {\left(\hat{f}\!_h(x) - f(x)\right)}^2 dx \right]</math>

Under weak assumptions on and , ( is the, generally unknown, real density function), with the general consensus that the plug-in selectors and cross validation selectors are the most useful over a wide range of data sets.

Substituting any bandwidth which has the same asymptotic order as into the AMISE gives that , where is the big O notation. It can be shown that, under weak assumptions, there cannot exist a non-parametric estimator that converges at a faster rate than the kernel estimator. Note that the rate is slower than the typical convergence rate of parametric methods.

If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (pointwise estimator), this produces a particularly powerful method termed adaptive or variable bandwidth kernel density estimation.

Bandwidth selection for kernel density estimation of heavy-tailed distributions is relatively difficult.

A rule-of-thumb bandwidth estimator

If Gaussian basis functions are used to approximate univariate data, and the underlying density being estimated is Gaussian, the optimal choice for h (that is, the bandwidth that minimises the mean integrated squared error) is:

<math display="block">h = {\left(\frac{4\hat{\sigma}^5}{3n}\right)}^{1/5} \approx 1.06 \, \hat{\sigma} \, n^{-1/5},</math>

An <math>h</math> value is considered more robust when it improves the fit for long-tailed and skewed distributions or for bimodal mixture distributions. This is often done empirically by replacing the standard deviation <math>\hat{\sigma}</math> by the parameter <math>A</math> below:

<math display="block">A = \min\left(\hat{\sigma}, \frac{\mathrm{IQR{1.34}\right)</math> where IQR is the interquartile range.

thumb|upright=1.25|alt=Comparison between rule of thumb and solve-the-equation bandwidth|Comparison between rule of thumb and solve-the-equation bandwidth.

Another modification that will improve the model is to reduce the factor from 1.06 to 0.9. Then the final formula would be:

<math display="block">h = 0.9\, \min\left(\hat{\sigma}, \frac{\mathrm{IQR{1.34}\right)\, n^{-1/5}</math>

where <math>n</math> is the sample size.

This approximation is termed the normal distribution approximation, Gaussian approximation, or Silverman's rule of thumb. where <math>g(x)</math> and <math>\lambda_1(x)</math> are KDE version of <math>g(x)</math> and <math>\lambda_1(x)</math>. Under mild assumptions, <math>M_c</math> is a consistent estimator of <math>M</math>. Note that one can use the mean shift algorithm to compute the estimator <math>M_c</math> numerically.

Statistical implementation

A non-exhaustive list of software implementations of kernel density estimators includes:

In Analytica release 4.4, the Smoothing option for PDF results uses KDE, and from expressions it is available via the built-in <code>Pdf</code> function.
In C/C++, FIGTree is a library that can be used to compute kernel density estimates using normal kernels. MATLAB interface available.
In C++, libagf is a library for variable kernel density estimation.
In C++, mlpack is a library that can compute KDE using many different kernels. It allows to set an error tolerance for faster computation. Python and R interfaces are available.
in C# and F#, Math.NET Numerics is an open source library for numerical computation which includes kernel density estimation
In CrimeStat, kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single- and dual-kernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a two-dimensional Journey-to-crime density function, and in estimating a three-dimensional Bayesian Journey-to-crime estimate.
In ELKI, kernel density functions can be found in the package <code>de.lmu.ifi.dbs.elki.math.statistics.kernelfunctions</code>
In ESRI products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Quartic(biweight) kernel.
In Excel, the Royal Society of Chemistry has created an add-in to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4.
In gnuplot, kernel density estimation is implemented by the <code>smooth kdensity</code> option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically according to "Silverman's rule of thumb" (see above).
In Haskell, kernel density is implemented in the statistics package.
In IGOR Pro, kernel density estimation is implemented by the <code>StatsKDE</code> operation (added in Igor Pro 7.00). Bandwidth can be user specified or estimated by means of Silverman, Scott or Bowmann and Azzalini. Kernel types are: Epanechnikov, Bi-weight, Tri-weight, Triangular, Gaussian and Rectangular.
In Java, the Weka machine learning package provides weka.estimators.KernelEstimator, among others.
In JavaScript, the visualization package D3.js offers a KDE package in its science.stats package.
In JMP, the Graph Builder platform utilizes kernel density estimation to provide contour plots and high density regions (HDRs) for bivariate densities, and violin plots and HDRs for univariate densities. Sliders allow the user to vary the bandwidth. Bivariate and univariate kernel density estimates are also provided by the Fit Y by X and Distribution platforms, respectively.
In Julia, kernel density estimation is implemented in the KernelDensity.jl package.
In KNIME, 1D and 2D Kernel Density distributions can be generated and plotted using nodes from the Vernalis community contribution, e.g. 1D Kernel Density Plot, among others. The underlying implementation is written in Java.
In MATLAB, kernel density estimation is implemented through the <code>ksdensity</code> function (Statistics Toolbox). As of the 2018a release of MATLAB, both the bandwidth and kernel smoother can be specified, including other options such as specifying the range of the kernel density. Alternatively, a free MATLAB software package which implements an automatic bandwidth selection method).
In Mathematica, numeric kernel density estimation is implemented by the function <code>SmoothKernelDistribution</code> and symbolic estimation is implemented using the function <code>KernelMixtureDistribution</code> both of which provide data-driven bandwidths.
In Minitab, the Royal Society of Chemistry has created a macro to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4.
In the NAG Library, kernel density estimation is implemented via the <code>g10ba</code> routine (available in both the Fortran and the C versions of the Library).
In Nuklei, C++ kernel density methods focus on data from the Special Euclidean group <math>SE(3)</math>.
In Octave, kernel density estimation is implemented by the <code>kernel_density</code> option (econometrics package).
In Origin, 2D kernel density plot can be made from its user interface, and two functions, Ksdensity for 1D and Ks2density for 2D can be used from its LabTalk , Python, or C code.
In Perl, an implementation can be found in the Statistics-KernelEstimation module
In PHP, an implementation can be found in the MathPHP library
In Python, many implementations exist: pyqt_fit.kde Module in the PyQt-Fit package, SciPy (<code>scipy.stats.gaussian_kde</code>), Statsmodels (<code>KDEUnivariate</code> and <code>KDEMultivariate</code>), and scikit-learn (<code>KernelDensity</code>) (see comparison). KDEpy supports weighted data and its FFT implementation is orders of magnitude faster than the other implementations. The commonly used pandas library [https://pandas.pydata.org/] offers support for kde plotting through the plot method (<code>df.plot(kind='kde')</code>[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html]). The getdist package for weighted and correlated MCMC samples supports optimized bandwidth, boundary correction and higher-order methods for 1D and 2D distributions. One newly used package for kernel density estimation is seaborn ( <code>import seaborn as sns</code> , <code>sns.kdeplot()</code> ). A GPU implementation of KDE also exists.
In R, it is implemented through <code>density</code> in the base distribution, and <code>bw.nrd0</code> function is used in stats package, this function uses the optimized formula in Silverman's book. <code>bkde</code> in the KernSmooth library, <code>ParetoDensityEstimation</code> in the DataVisualizations library (for pareto distribution density estimation), <code>kde</code> in the ks library, <code>dkden</code> and <code>dbckden</code> in the evmix library (latter for boundary corrected kernel density estimation for bounded support), <code>npudens</code> in the np library (numeric and categorical data), <code>sm.density</code> in the sm library. For an implementation of the <code>kde.R</code> function, which does not require installing any packages or libraries, see kde.R. The btb library, dedicated to urban analysis, implements kernel density estimation through <code>kernel_smoothing</code>.
In SAS, <code>proc kde</code> can be used to estimate univariate and bivariate kernel densities.
In Apache Spark, the <code>KernelDensity()</code> class
In Stata, it is implemented through <code>kdensity</code>; for example <code>histogram x, kdensity</code>. Alternatively a free Stata module KDENS is available allowing a user to estimate 1D or 2D density functions.
In Swift, it is implemented through <code>SwiftStats.KernelDensityEstimation</code> in the open-source statistics library SwiftStats.

References

External links

Introduction to kernel density estimation A short tutorial which motivates kernel density estimators as an improvement over histograms.
Kernel Bandwidth Optimization A free online tool that generates an optimized kernel density estimate.
Free Online Software (Calculator) computes the Kernel Density Estimation for a data series according to the following Kernels: Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine.
Kernel Density Estimation Applet An online interactive example of kernel density estimation. Requires .NET 3.0 or later.