Statistical hypothesis test

{| class="wikitable" style="float:right; clear:right;"

! Test Statistic !! Type of Test

| t-statistic || t-test Regression test

| F-statistic || ANOVA MANOVA ANCOVA

| z-statistic || z-test

| x2-statistic || Chi-square test

| colspan=2 | Some of the most common test statistics and their corresponding statistical tests or models.

A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests are in use.

Definition of terms

The goal of a hypothesis test is to establish whether certain properties of a statistical population are true by examining sample data. Typically, the population is modelled by a random variable whose distribution has unknown parameters. For example, a medical trial may wish to establish whether a particular drug is effective in treating high blood pressure, with "the change in blood pressure observed in a patient who takes the drug" being the random variable. An example hypothesis could be "the mean change in blood pressure is zero" or "the mean change in blood pressure is negative". In general, any statement about the parameters describing a population can be a hypothesis (but not a statement about the sample).

The test compares two hypotheses: a default "null" hypothesis (denoted H0) and its negation, the "alternative" hypothesis (H1). Typically the test will select a null hypothesis that the intervention being studied has no effect, or that the population parameter takes some "obvious" value. A test statistic is computed from the given sample data, and the tester calculates the conditional probability of observing a value at least this extreme, supposing the null hypothesis is true. If this probability (called the p-value) is less than the significance level of the test (denoted <math>\alpha</math>), then the null hypothesis is rejected. The test does not conclude that the null hypothesis is false, or that the probability that the null hypothesis is false is less than <math>\alpha</math>.

Because it is usually impossible to definitely establish whether the hypothesis being tested is true or false from a sample, the conclusion of a hypothesis test is not certain to be correct. There are two possible classes of error:

A type I error, in which the null hypothesis is rejected despite the null hypothesis being true, with probability <math>\alpha = P(\text{reject } H_0|H_0)</math>. This is the same as the significance level of the test.
A type II error, in which the null hypothesis is accepted despite the alternative hypothesis being true, with probability <math>\beta = P(\text{accept }H_0|H_1)</math>. The quantity <math>1-\beta</math> is called the power of the test.

Some further definitions:

: Any hypothesis which specifies the population distribution completely.
Composite hypothesis: Any hypothesis which does not specify the population distribution completely.
Positive data: Data that enable the investigator to reject a null hypothesis.

thumb|260x260px|Suppose the data can be realized from an N(0,1) distribution. For example, with a chosen significance level α = 0.05, from the Z-table, a one-tailed critical value of approximately 1.645 can be obtained. The one-tailed critical value Cα ≈ 1.645 corresponds to the chosen significance level. The critical region [Cα, ∞) is realized as the tail of the standard normal distribution.

s of a statistical test are the boundaries of the acceptance region of the test. The acceptance region is the set of values of the test statistic for which the null hypothesis is not rejected. Depending on the shape of the acceptance region, there can be one or more than one critical value.
/ : The set of values of the test statistic for which the null hypothesis is rejected.
Size: For simple hypotheses, this is the test's probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed specificity in biostatistics. ("This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.") See sensitivity and specificity and type I and type II errors for exhaustive definitions.
: A predecessor to the statistical hypothesis test (see the Origins section). An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used for the modern version which is now part of statistical hypothesis testing.
Conservative test: A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level.
Exact test

A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality:

Most powerful test: For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis.
Uniformly most powerful test (UMP)

History

While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth; see .

1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus, the null hypothesis in this case that the birthrates of boys and girls should be equal given "conventional wisdom".

1904: Karl Pearson develops the concept of "contingency" in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities".

Modern origins and early controversy

Modern significance testing is largely the product of Karl Pearson (p-value, Pearson's chi-squared test), William Sealy Gosset (Student's t-distribution), and Ronald Fisher ("null hypothesis", analysis of variance, "significance test"), while hypothesis testing was developed by Jerzy Neyman and Egon Pearson (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the principle of indifference when determining prior probabilities), and sought to provide a more "objective" approach to inductive inference.

Fisher emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.

Fisher popularized the "significance test". He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error (false negative).

The p-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one's faith in the null hypothesis. Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher's p-value, also meant to determine researcher behaviour, but without requiring any inductive inference by the researcher.

Neyman & Pearson considered a different problem to Fisher (which they called "hypothesis testing"). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities.

Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing (the defining paper

The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference.

Neyman accepted a position in the University of California, Berkeley in 1938, breaking his partnership with Pearson and separating the disputants (who had previously occupied the same building). The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher's death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman's later publications reported p-values and significance levels.

Null hypothesis significance testing (NHST)

The modern version of hypothesis testing is generally called the null hypothesis significance testing (NHST) (but signal detection, for example, still uses the Neyman/Pearson formulation). Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs.

Sometime around 1940,

Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.

Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments.

Hypothesis testing is of continuing interest to philosophers.

Education

Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly. An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student's t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.

Raymond S. Nickerson commented: