In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

The level of measurement is the type of data that is measured.

The word scale, including in academic literature, is sometimes used to refer to another composite measure, that of an index. Those concepts are however different.

Scale construction decisions

  • What level (level of measurement) of data is involved (nominal, ordinal, interval, or ratio)?
  • What will the results be used for?
  • What should be used - a scale, index, or typology?
  • What types of statistical analysis would be useful?
  • Choose to use a comparative scale or a non-comparative scale.
  • How many scale divisions or categories should be used (1 to 10; 1 to 7; −3 to +3)?
  • Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.)
  • What should the nature and descriptiveness of the scale labels be?
  • What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal)
  • Should a response be forced or be left optional?

Scale construction method

Scales constructed should be representative of the construct that it intends to measure. It is possible that something similar to the scale a person intends to create will already exist, so including those scale(s) and possible dependent variables in one's survey may increase validity of one's scale.

  1. Begin by generating at least ten items to represent each of the sub-scales. Administer the survey; the more representative and larger the sample, the more credibility one will have in the scales.
  2. Review the means and standard deviations for the items, dropping any items with skewed means or very low variance.
  3. Run an exploratory factor analysis with oblique rotation on items for the scales - it is important to differentiate them based on their loading on factors to create sub-scales that represents the construct. Request factors with eigenvalues (for calculating eigenvalue for each factor square the factor loading's and sum down the columns) greater than 1. It is easier to group the items by targeted scales. The more distinct the other items, the better the chances the items will load better in one's own scale.
  4. “Cleanly loaded items” are those items that load at least .40 on one factor and more than .10 greater on that factor than on any others. Identify those in the factor pattern.
  5. “Cross loaded items” are those that do not meet the above criterion. These are candidates to drop.
  6. Identify factors with only a few items that do not represent clear concepts, these are “uninterpretable scales.” Also identify any factors with only one item. These factors and their items are candidates to drop.
  7. Look at the candidates to drop and the factors to be dropped. Is there anything that needs to be retained because it is critical to one's construct. For example, if a conceptually important item only cross loads on a factor to be dropped, it is good to keep it for the next round.
  8. Drop the items, and run a confirmatory factor analysis asking the program to give only the number of factors after dropping the uninterpretable and single-item ones. Go through the process again starting at Step 3. Here various test reliability measures could also be taken.
  9. Keep running through the process until one get “clean factors” (until all factors have cleanly loaded items).
  10. Run the Alpha in the statistical program with the aim of obtaining a .70 reliability score (internal consistency), and request the Alphas if each item is dropped. Any scales with insufficient Alphas should be dropped, and the process should be repeated from Step 3. Remember that Alphas are not proof of scale quality or content validity. [Coefficient alpha=number of items<sup>2</sup> x average correlation between different items/sum of all correlations in the correlation matrix (including the diagonal values)]
  11. Run correlational or regressional statistics to ensure the validity of the scale. For better practices, keep the final factors and all loadings of yours and similar scales selected in the Appendix of the created scale.

Multi-Item and Single-Item Scales

In most practical situations, multi-item scales are more effective in predicting outcomes compared to single items. The use of single-item measures in research is advised cautiously, their use should be limited to specific circumstances.

{| class="wikitable"

!Criterion

!Multi-item scale

!Single-item scale

|-

!Construct concreteness

|Abstract

|Concrete

|-

!Construct dimensionality/complexity

|Multidimensional, moderately complex

|Unidimensional or extremely complex

|-

!Semantic redundancy

|Low

|High

|-

!Primary role of construct

|Dependent or independent variable

|Moderator or control variable

|-

!Desired precision

|High

|Low

|-

!Monitoring changes

|Appropriate

|Problematic

|-

!Sampled population

|Homogenous

|Diverse

|-

!Sample size

|Large

|Limited

|}

Table: Criteria for Assessing the Potential Use of Single-Item Measures

  • Likert scale – Respondents are asked to indicate the amount of agreement or disagreement (from strongly agree to strongly disagree) on a five- to nine-point response scale (not to be confused with a Likert scale). The same format is used for multiple questions. It is the combination of these questions that forms the Likert scale. This categorical scaling procedure can easily be extended to a magnitude estimation procedure that uses the full scale of numbers rather than verbal categories.
  • Phrase completion scales – Respondents are asked to complete a phrase on an 11-point response scale in which 0 represents the absence of the theoretical construct and 10 represents the theorized maximum amount of the construct being measured. The same basic format is used for multiple questions.
  • Semantic differential scale – Respondents are asked to rate on a 7-point scale an item on various attributes. Each attribute requires a scale with bipolar terminal labels.
  • Stapel scale – This is a unipolar ten-point rating scale. It ranges from +5 to &minus;5 and has no neutral zero point.
  • Thurstone scale – This is a scaling technique that incorporates the intensity structure among indicators.
  • Mathematically derived scale – Researchers infer respondents’ evaluations mathematically. Two examples are multi dimensional scaling and conjoint analysis.

Scale evaluation

Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale one have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure.

Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what is supposed to measured. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.

See also

  • Rating scale
  • Level of measurement
  • Scale (analytical tool)
  • Social research
  • Marketing
  • Marketing research
  • Quantitative marketing research
  • Power law
  • Psychophysics

References

Further reading

  • Paperback
  • Bradley, R.A. & Terry, M.E. (1952): Rank analysis of incomplete block designs, I. the method of paired comparisons. Biometrika, 39, 324–345.
  • Campbell, D. T. & Fiske, D. W. (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
  • Hodge, D. R. & Gillespie, D. F. (2003). Phrase Completions: An alternative to Likert scales. Social Work Research, 27(1), 45–55.
  • Hodge, D. R. & Gillespie, D. F. (2005). Phrase Completion Scales. In K. Kempf-Leonard (Editor). Encyclopedia of Social Measurement. (Vol. 3, pp.&nbsp;53–62). San Diego: Academic Press.
  • Krus, D. J. & Kennedy, P. H. (1977) Normal scaling of dominance matrices: The domain-referenced model. Educational and Psychological Measurement, 37, 189–193
  • Krus, D. J. & Ney, R. G. (1978) Convergent and discriminant validity in item analysis. Educational and Psychological Measurement, 38, 135–137
  • Luce, R.D. (1959): Individual Choice Behaviours: A Theoretical Analysis. New York: J. Wiley.
  • Handbook of Management Scales – Multi-item metrics to be used in research, Wikibooks

lt:Matavimų skalė

pt:Escala (estatística)

fi:Mitta-asteikko