Construct validity

Construct validity concerns how well a set of indicators represents or reflects a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.

Construct validity is the appropriateness of inferences made based on observations or measurements (often test scores), specifically whether a test can reasonably be considered to reflect the intended construct. Constructs are abstractions that are deliberately created by researchers to conceptualize the latent variable, which is correlated with scores on a given measure (although it is not directly observable). Construct validity examines the question: Does the measure behave like the theory says a measure of that construct should behave?

Construct validity is essential to the perceived overall validity of the test. Construct validity is particularly important in the social sciences, psychology, psychometrics and language studies.

Psychologists such as Samuel Messick (1998) have pushed for a unified view of construct validity "...as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores..." While Messick's views are popularized in educational measurement and originated in a career around explaining validity in the context of the testing industry, a definition more in line with foundational psychological research, supported by data-driven empirical studies that emphasize statistical and causal reasoning was given by (Borsboom et al., 2004).

Key to construct validity are the theoretical ideas behind the trait under consideration, i.e., the concepts that organize how aspects of personality, intelligence, etc. are viewed. Paul Meehl states that, "The best construct is the one around which we can build the greatest number of inferences, most directly."

History

Throughout the 1940s scientists had been trying to come up with ways to validate experiments prior to publishing them. The result of this was a plethora of different validities (intrinsic validity, face validity, logical validity, empirical validity, etc.). This made it difficult to tell which ones were actually the same and which ones were not useful at all. Until the middle of the 1950s, there were very few universally accepted methods to validate psychological experiments. The main reason for this was because no one had figured out exactly which qualities of the experiments should be looked at before publishing. Between 1950 and 1954 the APA Committee on Psychological Tests met and discussed the issues surrounding the validation of psychological experiments. They are closely related to three stages in the test construction process: constitution of the pool of items, analysis and selection of the internal structure of the pool of items, and correlation of test scores with criteria and other variables.

In the 1970s there was growing debate between theorists who began to see construct validity as the dominant model pushing towards a more unified theory of validity, and those who continued to work from multiple validity frameworks. Many psychologists and education researchers saw "predictive, concurrent, and content validities as essentially ad hoc, construct validity was the whole of validity from a scientific point of view" Under this framework, all forms of validity are connected to and are dependent on the quality of the construct. He noted that a unified theory was not his own idea, but rather the culmination of debate and discussion within the scientific community over the preceding decades. There are six aspects of construct validity in Messick's unified theory of construct validity:

Consequential – What are the potential risks if the scores are invalid or inappropriately interpreted? Is the test still worthwhile given the risks?
Content – Do test items appear to be measuring the construct of interest?
Substantive – Is the theoretical foundation underlying the construct of interest sound?
Structural – Do the interrelationships of dimensions measured by the test correlate with the construct of interest and test scores?
External – Does the test have convergent, discriminant, and predictive qualities?
Generalizability – Does the test generalize across different groups, settings and tasks?

How construct validity should properly be viewed is still a subject of debate for validity theorists. The core of the difference lies in an epistemological difference between positivist and postpositivist theorists.

Evaluation

Evaluation of construct validity requires that the correlations of the measure be examined in regard to variables that are known to be related to the construct (purportedly measured by the instrument being evaluated or for which there are theoretical grounds for expecting it to be related). This is consistent with the multitrait-multimethod matrix (MTMM) of examining construct validity described in Campbell and Fiske's landmark paper (1959). A single study does not prove construct validity. Rather it is a continuous process of evaluation, reevaluation, refinement, and development. Correlations that fit the expected pattern contribute evidence of construct validity. Construct validity is a judgment based on the accumulation of correlations from numerous studies using the instrument being evaluated.

Most researchers attempt to test the construct validity before the main research. To do this pilot studies may be utilized. Pilot studies are small scale preliminary studies aimed at testing the feasibility of a full-scale test. These pilot studies establish the strength of their research and allow them to make any necessary adjustments. Another method is the known-groups technique, which involves administering the measurement instrument to groups expected to differ due to known characteristics. Hypothesized relationship testing involves logical analysis based on theory or prior research.

Convergent and discriminant validity

Convergent and discriminant validity are the two subtypes of validity that make up construct validity. Convergent validity refers to the degree to which two measures of constructs that theoretically should be related, are in fact related. In contrast, discriminant validity tests whether concepts or measurements that are supposed to be unrelated are, in fact, unrelated. Take, for example, a construct of general happiness. If a measure of general happiness had convergent validity, then constructs similar to happiness (satisfaction, contentment, cheerfulness, etc.) should relate positively to the measure of general happiness. If this measure has discriminant validity, then constructs that are not supposed to be related positively to general happiness (sadness, depression, despair, etc.) should not relate to the measure of general happiness. Measures can have one of the subtypes of construct validity and not the other. Using the example of general happiness, a researcher could create an inventory where there is a very high positive correlation between general happiness and contentment, but if there is also a significant positive correlation between happiness and depression, then the measure's construct validity is called into question. The test has convergent validity but not discriminant validity.

Nomological network

Lee Cronbach and Paul Meehl (1955) and short term loading. Creating a nomological net can also make the observation and measurement of existing constructs more efficient by pinpointing errors.

Threats to construct validity

Apparent construct validity can be misleading due to a range of problems in hypothesis formulation and experimental design.

Hypothesis guessing: If the participant knows, or guesses, the desired end-result, the participant's actions may change. An example is the Hawthorne effect: in a 1925 industrial ergonomics study conducted at the Hawthorne Works factory outside Chicago, experimenters observed that both lowering and brightening the ambient light levels improved worker productivity. They eventually determined the basis for this paradoxical result: workers who were aware of being observed worked harder no matter what the change in the environment.
Bias in experimental design (intentional or unintentional). An example of this is provided in Stephen Jay Gould's 1981 book "The Mismeasure of Man". Among the questions used around the time of World War I in the battery used to measure intelligence was "In which city do the Dodgers play?" (they were then based in Brooklyn). Recent immigrants to the US from Eastern Europe unfamiliar with the sport of baseball got the answer wrong, and this was used to infer that Eastern Europeans had lower intelligence. The question did not measure intelligence: it only measured how long one had lived in the US and become accultured to a popular pastime.
Researcher expectations may be communicated unintentionally to the participants non-verbally, eliciting the desired effect. To control for this possibility, double-blind experimental designs should be used where possible. That is, the evaluator of a particular participant should be unaware of what intervention has been performed on that particular participant or should be independent of the experimenter.
Defining predicted outcome too narrowly. For instance, using only job satisfaction to measure happiness will exclude relevant information from outside the workplace.
Confounding variables (covariates): The root cause for the observed effects may be due to variables that have not been considered or measured.

An in-depth exploration of the threats to construct validity is presented in Trochim.

References

External links

Useful reference guide for research terms
Provides a visual representation of the nomological network