G-test - WikiHQ

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.

Formulation

The general formula for test statistics of the G-test is

:<math>

G = 2\sum_{i} {O_{i} \cdot \ln\left(\frac{O_i}{E_i}\right)},

</math>

where <math>O_i \geq 0</math> is the observed count in a cell, <math>E_i > 0</math> is the expected count under the null hypothesis, <math>\ln</math> denotes the natural logarithm, and the sum is taken over all non-empty cells. The resulting <math>G</math> is asymptotically chi-squared distributed as the total number of observations tends to infinity (convergence in distribution).

Furthermore, the total observed count must be equal to the total expected count:

:<math>

\sum_i O_i = \sum_i E_i = N,

</math>

where <math>N</math> is the total number of observations.

Both, the G-test statistics <math>G</math> and the chi-square test statistics <math>\chi^2</math> are special cases of a general family of power divergence statistics by Cressie and Read McDonald recommends to always use an exact test (exact test of goodness-of-fit, Fisher's exact test) if the total sample size is less than 1 000 .

:There is nothing magical about a sample size of 1 000, it's just a nice round number that is well within the range where an exact test, chi-square test, and G–test will give almost identical  values. Spreadsheets, web-page calculators, and SAS shouldn't have any problem doing an exact test on a sample size of 1 000 .

:::: — John H. McDonald (2014)

Relation to other metrics

Relation to the chi-squared test

The commonly used chi-squared tests for goodness of fit to a distribution and for independence in contingency tables are in fact approximations of the log-likelihood ratio on which the G-tests are based.

The general formula for Pearson's chi-squared test statistic is

:<math>

\chi^2 = \sum_{i} {\frac{\left(O_i - E_i\right)^2}{E_i.

</math>

The approximation of the G-test statistics by chi-squared test statistics is obtained by a second order Taylor expansion of the natural logarithm around 1 (see the derivation below).

We have <math>G \approx \chi^2</math> when the observed counts <math>O_i</math> are close to the expected counts <math>E_i</math>. When this difference is large, however, the approximation by the chi-squared test statistics begins to break down. Here, the effects of outliers in data will be more pronounced, and this explains the why chi-squared tests fail in situations with little data.

For samples of a reasonable size, the G-test and the chi-squared test will lead to the same conclusions. However, the approximation to the theoretical chi-squared distribution for the G-test is better than for the Pearson's chi-squared test. In cases where <math>O_i > 2\cdot E_i</math> for some cell case the G-test is always better than the chi-squared test.

For testing goodness-of-fit the G-test is infinitely more efficient than the chi-squared test in the sense of Bahadur, but the two tests are equally efficient in the sense of Pitman or in the sense of Hodges and Lehmann.

Derivation (chi-squared)

Consider

:<math>

G = 2\sum_i {O_i \ln\left(\frac{O_i}{E_i}\right)},

</math>

and let <math>O_i = E_i + \delta_i</math> with <math>\textstyle\sum_i \delta_i = 0</math>, so that the total number of counts remains the same. Assume that <math>\delta_i=O_i-E_i</math> is small in comparison to <math>E_i</math> for all <math>i</math>. To be more precise, notice that <math>E_i = \Theta(n)</math> using big Θ notation. If <math>O_i = E_i + \mathcal{O}(n^{1/2})</math> using big O notation for large <math>n</math>, which should be true under the null hypothesis because of the central limit theorem, then <math>\delta_i = \mathcal{O}(n^{1/2})</math> and

:<math>

\frac{\delta_i^3}{E_i^2} = \mathcal{O}\left(\frac{n^{3/2{n^2}\right) = \mathcal{O}(n^{-1/2})

</math>

follow, which will be used later.

Upon substitution we find,

:<math>

G = 2\sum_i (E_i + \delta_i) \ln \left(1+\frac{\delta_i}{E_i}\right).

</math>

Using the Taylor expansion <math>\ln(1 + x) = x - \tfrac{1}{2}x^2 + \mathcal{O}(x^3)</math> yields

:<math>

G = 2\sum_i (E_i + \delta_i) \left(\frac{\delta_i}{E_i} - \frac{1}{2}\frac{\delta_i^2}{E_i^2} + \mathcal{O}\left(\frac{\delta_i^3}{E_i^3}\right) \right),

</math>

and distributing terms we find,

:<math>

G = 2\sum_i \left( \delta_i + \frac{1}{2}\frac{\delta_i^2}{E_i} + \mathcal{O}\left(\frac{\delta_i^3}{E_i^2}\right) \right).

</math>

Now, using <math>\textstyle\sum_i \delta_i = 0</math> and <math>\delta_i = O_i - E_i</math> and <math>\mathcal{O}(\delta_i^3/E_i^2)=\mathcal{O}(n^{-1/2})</math> for large <math>n</math>, we can write the result,

:<math>

G \approx \sum_{i} \frac{\left(O_i-E_i\right)^2}{E_i}.

</math>

Relation to Kullback–Leibler divergence

The G-test statistic is proportional to the Kullback–Leibler divergence of the theoretical distribution <math>\tilde p=(\tilde p_1,\ldots,\tilde p_m)</math> of the null hypothesis from the empirical distribution <math>\hat p=(\hat p_1,\ldots,\hat p_m)</math> of the observed data:

:<math>

\begin{align}

&= 2\sum_i {O_i \cdot \ln\left(\frac{O_i}{E_i}\right)}

= 2 N \sum_i {\hat p_i \cdot \ln\left(\frac{\hat p_i}{\tilde p_i}\right)} \\

&= 2 N \, D_{\mathrm{KL(\hat p\|\tilde p),

\end{align}</math>

where <math>N</math> is the total number of observations and <math>\tilde p_i = \tfrac{E_i}{N}</math> and <math>\hat p_i = \tfrac{O_i}{N}</math> are the theoretical and empirical probabilities of objects of type <math>i</math>, respectively.

Relation to mutual information

For analysis of contingency tables the value of the G-test statistics can also be expressed in terms of mutual information.

In this case objects with two-dimensional types <math>(i,j)</math> are considered. Let <math>O_{ij}</math> be the count of objects of type <math>(i,j)</math>, i.e., <math>O_{ij}</math> is the entry in the contingency table in row <math>i</math> and column <math>j</math>. Set

:<math>

N = \sum_{ij} O_{ij}, \qquad

\hat p_{ij} = \frac{O_{ij{N} \,, \qquad

\hat p_{i \bullet} = \frac{\sum_j O_{ij{N} \,, \qquad

\hat p_{\bullet j} = \frac{\sum_i O_{ij{N} \,.

</math>

Then the estimated expected count of objects of type <math>(i,j)</math> assuming independence is given by

:<math>

E_{ij} = N \hat p_{i \bullet} \hat p_{\bullet j}.

</math>

Finally, the G-test statistics in this case is given by

:<math>

G = 2 \sum_{ij} O_{ij} \ln\left(\frac{O_{ij{E_{ij\right)

</math>

Let <math>X,Y</math> be random variables with joint distribution given by the empirical distribution <math>\hat p_{ij}</math> of the contingency table, i.e.,

:<math>

P(X=i, Y=j) = \hat p_{ij}, \qquad

P(X=i) = \hat p_{i \bullet}, \qquad

P(Y=j) = \hat p_{\bullet j}.

</math>

Then the G-test statistics can be expressed in several alternative forms:

:<math>\begin{align}

&= 2N \cdot \sum_{ij}{\hat p_{ij} \left( \ln(\hat p_{ij})-\ln(\hat p_{i \bullet})-\ln(\hat p_{\bullet j}) \right)} \\

&= 2N \cdot \Bigl( H(X) + H(Y) - H(X,Y) \Bigr) \\

&= 2N \cdot \operatorname{MI}(X,Y),

\end{align}</math>

where the entropies <math>H(X)</math> and <math>H(Y)</math> are given

:<math>

H(X) = -\sum_i \hat p_{i \bullet} \ln(\hat p_{i \bullet}), \qquad

H(Y) = -\sum_j \hat p_{\bullet j} \ln(\hat p_{\bullet j})

</math>

and the joint entropy <math>H(X,Y)</math> is given by

:<math>

H(X,Y) = -\sum_{ij} \hat p_{ij} \ln(\hat p_{ij})

</math>

and the mutual information of <math>X</math> and <math>Y</math> is

:<math>

\operatorname{MI}(X,Y) = H(X) + H(Y) - H(X,Y).

</math>

It can also be shown that the inverse document frequency weighting commonly used for text retrieval is an approximation of G applicable when the row sum for the query is much smaller than the row sum for the remainder of the corpus. Similarly, the result of Bayesian inference applied to a choice of single multinomial distribution for all rows of the contingency table taken together versus the more general alternative of a separate multinomial per row produces results very similar to the G-test statistic.

Application

The McDonald–Kreitman test in statistical genetics is an application of the G-test.
Dunning introduced the test to the computational linguistics community where it is now widely used.
The R-scape program (used by Rfam) uses G-test to detect co-variation between RNA sequence alignment positions.

Statistical software

In R fast implementations can be found in the AMR and Rfast packages. For the AMR package, the command is <code>g.test</code> which works exactly like <code>chisq.test</code> from base R. R also has the likelihood.test function in the Deducer package. Note: Fisher's G-test in the GeneCycle Package of the R programming language (<code>fisher.g.test</code>) does not implement the G-test as described in this article, but rather Fisher's exact test of Gaussian white-noise in a time series.
Another R implementation to compute the G-test statistic and corresponding p-values is provided by the R package entropy. The commands are <code>Gstat</code> for the standard G statistic and the associated p-value and <code>Gstatindep</code> for the G statistic applied to comparing joint and product distributions to test independence.
In SAS, one can conduct G-test by applying the <code>/chisq</code> option after the <code>proc freq</code>.
In Stata, one can conduct a G-test by applying the <code>lr</code> option after the <code>tabulate</code> command.
In Java, use <code>org.apache.commons.math3.stat.inference.GTest</code>.
In Python, use <code>scipy.stats.power_divergence</code> with <code>lambda_=0</code>.

References

External links

G<sup>2</sup>/Log-likelihood calculator