Monday 15 August 2016

7 Useful Articles on Content Validity in Relation to Measurement in Psychology and the Social Sciences





Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment : A functional approach to concepts and methods. Psychological Assessment, 7(3), 238-247. 

Haynes and colleagues define content validity as "the degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose" (p. 238). They contend that content validation is a multi-method, qualitative and quantative process, and go on to outline thirteen (!) different methods that can be used to assess the adequacy of the content of a test. They argue that procedures such as these, although important parts of the scale construction process, are often overlooked by the constructors of scales as well as the users of scales.

pdf here


Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83-117. 

Sireci traces the history of the concept of content validity. Amongst other things, Sireci reflects on (1) how the meaning of content validity has changed over time, (2) how APA recommendations on content validity have changed over time, and (3) how some, for example Messick, have argued that the content of a test, although important, does not influence the "validity" of a test at all. Despite this, Sireci concludes that the content validity of a psychological measure is of paramount concern. Quoting Ebel "data never substitute for good judgement", Sireci surmises that ensuring the content of a test is appropriate is better justification of a tests use than its meeting of statistical criteria.

pdf here


Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299-321.

Sireci defines content validity as the "degree to which a test measures the content domain it purports to measure" (p. 299). He discusses the strengths and weaknesses of various methods for studies of content validity. As well as many other things, Sireci warns that the researcher must be wary of social desirability biases when conducting studies of content validity with panels of experts, as well as the possibility that the expert is aware of what the instrument is supposed to measure. Sireci provides fifteen guidelines for conducting content validity studies, including selecting competent panel members, making the rating procedures simple, providing breaks, and providing incentives for thoughtful analysis. 


Guion, R. M. (1977). Content validity: The source of my discontent. Applied Psychological Measurement, 1(1), 1-10. 

Guion remarks on how, as a personnel psychologist, he was initially skeptical of the concept of content validity, and the relevance of it to his own work. He deduces that when people speak of content validity that they are referring to the extent to which the content of the test represents the “boundaries defining the content domain”. He examines the challenges that researchers of three different aspects of human performance or behaviour - reading comprehension, assertiveness, and aggression – face if they are to create a test which is representative of their target of inquiry. He maintains that “defining a content domain of stimulus situations and recordable responses, and then developing a standardized sample from that domain” is an essential part of the measurement construction process in all three cases, but that a set of rules have not been established for demonstrating the content validity of a psychological measure. 


Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28, 563-575.

Lawshe begins by remarking that, at the time of writing, there is a paucity of literature on content validity in employment testing. What’s more, no commonly accepted professional practice in relation to the demonstration of content validity has evolved, similar to the one that has evolved in relation to the demonstration of criterion-related validity. This is a problem, she argues, because psychometricians are both under increasing legal pressure to demonstrate the validity of their assessment instruments and coming to accept that criterion-related validity is only a part of the validity equation. Lawshe aims to fill this gap in professional practice, by forwarding a new procedure for content validation. This procedure consists of a content validation panel, comprised of experts, rating how essential what is measured by each item is to the target of inquiry. Lawshe introduces the content validity ration (CVR) which is expressed by the equation CVR = Ne – N/N, where Ne is the number of panelists indicating “essential” and N is the total number of panellists. The CVR can be computed for each item and the scores aggregated to provide a content validity index (CVI) for the whole scale or instrument.


Edward G. Carmines, & Richard A. Zeller. (1979). Reliability and Validity Assessment. Thousand Oaks, CA: SAGE Publications, Inc.

Carmines and Zeller’s (1979) entry on content validity is a short one (p. 20-22), but is extremely thought-provoking. The duo describe content validity as “the extent to which an empirical measurement reflects a specific domain of content” (p. 20). They describe a hypothetical  maths test, which only includes questions on addition, and ignores subtraction, multiplication and division, as prototypical of a measure with content validity problems. They argue that during the scale construction process the researcher must “sample” items from a large set, being careful not to bias the sampling process in favour of a particular area of content. When it comes to devising measures of attitudes, Carmines and Zeller argue, the scale constructor has her work cut out to achieve content validity. The jeopardy in this case is that the researcher must “construct items that reflect the meaning associated with each dimension and each subdimension” (p. 21), but the abstract theoretical concepts of the social sciences have not been described with exactness [it sounds like Carmines & Zeller read Fiske (1971) at some point]. I agree with this point: in the case of maths tests we have a very concrete target of inquiry: whether an individual can perform operations with numbers in the ways they have been taught to. They also say that it is “impossible” with some abstract concepts to sample content. I do feel that this point could have been fleshed out with examples: why not? Because there is not a large enough population to sample from? Or for some other reason? Carmines and Zeller end by lamenting that no codes of practice for determining content validity exist, or agreed upon criteria. In the absence of these things, they refer to a quote from Nunnally (1978): “Inevitably content validity rests mainly on appeals to reason regarding the adequacy with which the content has been cast in the form of test items”. 


Furr, M. R. (2011). Scale construction and psychometrics for social and personality psychology. Thousand Oaks, CA: SAGE Publications, Inc.

Furr’s entry on content validity, like Carmines and Zeller’s, is a short but interesting (p. 54-55). Furr defines content validity as “the degree to which a scale truly reflects its intended construct – no more, no less” (p. 55). Furr says that the scale constructor should be on the lookout for both construct irrelevant content (problems with those items included in the scale) and construct underrepresentation (omission of that which could be included in the scale). Furr talks of a global self-esteem scale which includes items pertaining to social skill as  a good example of a scale which includes construct irrelevant content. On the other hand, he talks of a global self-esteem scale that only includes items pertaining to academic self-esteem as a good example of a scale which suffers from construct underrepresentation. Furr goes on to say that the formula for good content validity is the careful articulation of the psychological construct, plus “thoughtful and critical evaluation (and re-evaluation) of item content” (p. 54). He also argues that strong evidence of content validity is best obtained by scrutiny and approval of the scale by “experts” not involved in the scale’s construction.