Reliability (statistics)

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:<blockquote>It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), are usually used to indicate the amount of error in the scores.</blockquote> For example, measurements of people's height and weight are often extremely reliable.

Types

There are several general classes of reliability estimates:

Inter-rater reliability assesses the degree of agreement between two or more raters in their appraisals. For example, a person gets a stomach ache and different doctors all give the same diagnosis.
Test-retest reliability assesses the degree to which test scores are consistent from one test administration to the next. Measurements are gathered from a single rater who uses the same methods or instruments and the same testing conditions.
Internal consistency reliability, assesses the consistency of results across items within a test.

For example, if a set of weighing scales consistently measured the weight of an object as 500 grams over the true weight, then the scale would be very reliable, but it would not be valid (as the returned weight is not the true weight). For the scale to be valid, it should return the true weight of an object. This example demonstrates that a perfectly reliable measure is not necessarily valid, but that a valid measure necessarily must be reliable.

General model

In practice, testing measures are never perfectly consistent. Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of two sorts of factors:

Mean error of measurement = 0
True scores and errors are uncorrelated
Errors on different measures are uncorrelated

Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true scores plus the variance of errors of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score.

Estimation

The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.

Four practical strategies have been developed that provide workable methods of estimating test reliability: Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder–Richardson Formula 20.

These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all types—yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, is often mentioned as the primary source for this standard. However, Nunnally actually said something rather different. He made a distinction between the early stages of basic research (where he thought that a reliability of 0.70 or higher was sufficient) and the situation where important decisions were being made about candidates (where "a reliability of 0.90 is the minimum ... and a reliability of 0.95 should be considered the desirable standard. (p. 246)").

Nunnally's recommendation of <math>{\rho}_{xx'}=0.95</math> for important decision-making was based on a concern that the cost of false negative errors in decision-making fall disproportionately on the candidates. Given sufficient applicants, the organization just hires a different candidate or admits a different student, but the candidate suffers the loss of an opportunity. Nunnally's recommendations may therefore be cast as a recommendation for ethical decision-making via assessment.

But he did not explicitly address any "costs" associated with attaining a high reliability. For example, if a 50-item test has reliability <math>{\rho}_{xx'}=0.80</math>, then the test length required for a reliability of <math>{\rho}^*_{xx'}=0.95</math> is about 238 items (and the additional items must be comparable to the existing items). Nunnally did not address how an organization would fund an exam of 238 items nor how examinees would feel about sitting an exam with 238 items. He did not discuss whether fatigue effects might negatively impact candidate scores, nor whether extremely long test administration windows might disadvantage some classes of candidates.

Trade-off with validity

A high value of reliability can conflict with content validity if a psychometrician removes items to maximize an estimate like coefficient alpha without regard to the content of the remaining items. Repeatedly measuring essentially the same question in different ways is often used solely to increase reliability while damaging content validity.

Trade-off with efficiency

When the other conditions are equal, reliability increases as the number of items increases. However, the increase in the number of items hinders the efficiency of measurements.

Methods to increase reliability

The following methods can be considered to increase reliability.

Before data collection:

Eliminate the ambiguity of the measurement item.
Do not measure what the respondents do not know.
Increase the number of items.
Use a scale that is known to be highly reliable.

After data collection:

Use item-analysis to identify and remove problematic items (carefully avoiding damage to content validity).

References

External links

pl:Rzetelność (metodologia nauki)#Rzetelność w psychometrii