Peg's Blog: February 2016

A growing list of articles on measurement in psychology and related disciplines. Some on the list (eventually all) have short descriptions of their content. Although quite a dry post, I hope it may be of use to at least one person.

Measurement Models and the Characteristics of Indicators

Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical review of construct indicators and measurement model misspecification in marketing and consumer research. Journal of Consumer Research, 30(2), 199-218.
Grace, J. B., & Bollen, K. A. (2008). Representing general theoretical concepts in structural equation models: the role of composite variables. Environmental and Ecological Statistics, 15(2), 191-213.
Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5(2), 155.
Bollen, K. A., & Bauldry, S. (2011). Three Cs in measurement models: Causal indicators, composite indicators, and covariates. Psychological Methods, 16(3), 265-284. [public access pdf from NCBI]

Bollen and Bauldry argue that researchers often assume, or reason, that responses to the elements of the assessment instrument - their measures (e.g. a collection of items on a questionnaire) - are caused by a latent variable: that they are effect indicators. This assumption, or reasoning, is in line with the theoretical rationale that underpinned the development of factor analysis. Although indicators may be effect indicators, Bollen and Bauldry argue that 3 other kinds of indicators exist -- causal, composite, and covariate. These alternative kinds of indicators are not best considered to be caused by a latent variable. Bollen and Bauldry urge researchers to scrutinize their indicators and to entertain different "measurement models". They outline a number of theoretical and empirical procedures for determining the nature of the relationship between the researchers target of inquiry and his or her indicators.

Bollen, K. A., Lennox, R. D., & Dahly, D. L. (2009). Practical application of the vanishing tetrad test for causal indicator measurement models: An example from health‐related quality of life. Statistics in Medicine, 28(10), 1524-1536.
Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305-314.
Bollen, K. A., & Ting, K. F. (2000). A tetrad test for causal indicators. Psychological Methods, 5(1), 3-22.
Bollen, K. A. (2011). Evaluating effect, composite, and causal indicators in structural equation models. MIS Quarterly, 35(2), 359-372.
Bollen, K. A. (1984). Multiple indicators: Internal consistency or no necessary relationship. Quality and Quantity, 18, 377-385. [pdf]
Guttman, L. (1971). Measurement as structural theory. Psychometrika, 36(4), 329-347.

Content Validity

Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment : A functional approach to concepts and methods. Psychological Assessment, 7(3), 238-247. [pdf]

Haynes and colleagues define content validity as "the degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose" (p. 238). They contend that content validation is a multi-method, qualitative and quantative process, and go on to outline thirteen different procedures that can be used in order to assess the adequacy of the content of a test. They argue that procedures such as these, although important parts of the scale construction process, are often ignored by scale constructors and users.

Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83-117. [link]

Sireci traces the history of the concept of content validity. Amongst other things, Sireci reflects on (1) how the meaning of content validity has changed over the years, (2) how APA (American Psychological Association) recommendations on content validity have changed over the years, and (3) how some, for example Messick, have argued that the content of a test, although important, does not influence the "validity" of a test at all. Despite this, Sireci concludes that the content validity of a psychological measure is of paramount concern. Quoting Ebel "data never substitue for good judgement", Sireci surmises that ensuring the content of a test is appropriate is better justification of a tests use than its meeting of statistical criteria.

Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299-321.

Sireci defines content validity as the "degree to which a test measures the content domain it purports to measure" (p. 299). He discusses the strengths and weaknesses of various methods for studies of content validity. As well as many other things, Sireci warns that the researcher must be wary of social desirability biases when conducting studies of content validity with panels of experts, and of the possibility that the expert is well aware of what the instrument is supposed to measure. Sireci provides fifteen guidelines for conducting content validity studies, including selecting competent panel members, making the rating procedures simple, providing breaks, and providing incentives for thoughtful analysis.

Guion, R. M. (1977). Content validity: The source of my discontent. Applied Psychological Measurement, 1(1), 1-10.

Guion remarks on how, as a personnel psychologist, he was initially skeptical of the concept of content validity, and the relevance of it to his own work. He deduces that when people speak of content validity that they are referring to the extent to which the content of the test represents the “boundaries defining the content domain”. He examines the challenges that researchers of three different aspects of human performance or behaviour - reading comprehension, assertiveness, and aggression – face if they are to create a test which is representative of their target of inquiry. He maintains that “defining a content domain of stimulus situations and recordable responses, and then developing a standardized sample from that domain” is an essential part of the measurement construction process in all three cases, but that no set of rules have thus far been established for demonstrating the content validity of a psychological measure. Despite the importance he places on content validity, Guion also shares with the reader a worry that an over-reliance on content validity .

Lawshe, C. H. (1975). A quantative approach to content validity. Personnel Psychology, 28, 563-575.

Lawshe begins by remarking that, at the time of writing, there is a paucity of literature on content validity in employment testing. What’s more, no commonly accepted professional practice in relation to the demonstration of content validity has evolved, similar to the one that has evolved in relation to the demonstration of criterion-related validity. This is a problem, she argues, because psychometricians are both under increasing legal pressure to demonstrate the validity of their assessment instruments and coming to accept that criterion-related validity is only a part of the validity equation. Lawshe aims to fill this gap in professional practice, by forwarding a new procedure for content validation. This procedure consists of a content validation panel, comprised of experts, rating how essential what is measured by each item is to the target of inquiry. Lawshe introduces the content validity ration (CVR) which is expressed by the equation CVR = Ne – N/N, where Ne is the number of panelists indicating “essential” and N is the total number of panellists. The CVR can be computed for each item and the scores aggregated to provide a content validity index (CVI) for the whole scale or instrument.

Edward G. Carmines, & Richard A. Zeller. (1979). Reliability and Validity Assessment. Thousand Oaks, CA: SAGE Publications, Inc.

Carmines and Zeller’s (1979) entry on content validity is a short one (p. 20-22), but is an extremely thought-provoking one. The duo describe content validity as “the extent to which an empirical measurement reflects a specific domain of content” (p. 20). They describe a hypothetical maths test, which only includes questions on addition, and ignores subtraction, multiplication and division, as prototypical of a measure with content validity problems. They argue that during the scale construction process the researcher must “sample” items from a large set, being careful not to bias the sampling process in favour of a particular area of content. When it comes to devising measures of attitudes, Carmines and Zeller argue, the scale constructor has her work cut out to achieve content validity. The jeopardy in this case is that the researcher must “construct items that reflect the meaning associated with each dimension and each subdimension” (p. 21), but the abstract theoretical concepts of the social sciences have not been described with exactness [it sounds like Carmines & Zeller read Fiske (1971) at some point]. I agree with this point: in the case of maths tests we have a very concrete target of inquiry: whether an individual can perform operations with numbers in the ways they have been taught to. They also say that it is “impossible” with some abstract concepts to sample content. I do feel that this point could have been fleshed out with examples: why not? Because there is not a large enough population to sample from? Or for some other reason? Carmines and Zeller end by lamenting that no codes of practice for determining content validity exist, or agreed upon criteria. In the absence of these things, they refer to a quote from Nunnally (1978): “Inevitably content validity rests mainly on appeals to reason regarding the adequacy with which the content has been cast in the form of test items”.

Furr, M. R. (2011). Scale construction and psychometrics for social and personality psychology. Thousand Oaks, CA: SAGE Publications, Inc.
Furr’s entry on content validity, like Carmines and Zeller’s, is a short but interesting one (p. 54-55). Furr defines content validity as “the degree to which a scale truly reflects its intended construct – no more, no less” (p. 55). Furr says that the scale constructor should be on the lookout for both construct irrelevant content (problems with those items included in the scale) and construct underrepresentation (omission of that which could be included in the scale). Furr talks of a global self-esteem scale which includes items pertaining to social skill as a good example of a scale which includes construct irrelevant content. On the other hand, he talks of a global self-esteem scale that only includes items pertaining to academic self-esteem as a good example of a scale which suffers from construct underrepresentation. Furr goes on to say that the formula for good content validity is the careful articulation of the psychological construct, plus “thoughtful and critical evaluation (and re-evaluation) of item content” (p. 54). He also argues that strong evidence of content validity is best obtained by scrutiny and approval of the scale by “experts” not involved in the scale’s construction.

Scale Construction

Burisch, M. (1984). Approaches to personality inventory construction. American Psychologist, 39(3), 214-227. [pdf]

This article has an informal style that a writer could probably only get away with in American Psychologist. In places, the article is no more than a humorous send-up of the current state of scale construction, and verges on being a gross over-simplification of psychometric techniques. In other places, the article contains astute analysis and evaluation of the current state of scale construction, as well as thoughtful recommendations for the would-be scale constructor. At all points it is highly engaging.
Burisch argues that there are three approaches to the construction of personality measures. The external, or empirical approach, is most often employed by those who wish to sort people in to known types (engineers, actors, schizophrenics). Heterogeneous pools of items are administered and items which successfully differentiate between types are retained. The inductive approach also starts with a large pool of items. Responses to many items are subject to statistical analysis or, what is surely a tongue-in-cheek descriptor, "matrix staring". Items are then grouped in to scales and are given labels. The number and nature of scales follow from data analysis. In a sense, the data is the chief engineer. In contrast, the deductive approach involves the definition of the target of measurement and then the writing of items that fit this definition. Burisch contends that there is no good reason why the three approaches cannot be combined into "mixed" strategies, and hints that this might be the optimal strategy for the discerning psychometrician. In his closing remark, Burisch simply asks readers to "Faites simple!": to keep it simple.
Interestingly, Burisch argues that a scale is "effective" is not sufficient evidence that a scale is "valid": "In a world in which almost everything is at least marginally correlated with almost everything else, it is too easy to construe a modest relationship with some remote variable post hoc as evidence of construct validity" (p. 217).
On the content of scales, Burisch has this to say: "If we simply collect items that discriminate people rated as gregarious from others (the external strategy) or if we sample items that cluster together with "defining" items (inductive strategy), there will always be a proportion of items that correlate with the construct but do not really fit its definition. Only content considerations can separate "hard-core" items from the rest. Buss and Craik (1981) have called these "prototypical", and I believe that only items of this kind should be included in a scale." (p. 218)

Furr, R. M. (2011). Scale construction and psychometrics. London: Sage.
Curtis, R. F., & Jackson, E. F. (1962). Multiple indicators in survey research. American Journal of Sociology, 195-204.
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309-319.
Loevinger, J. (1957). Objective tests as instruments of psychological theory: Monograph supplement 9. Psychological Reports, 3(3), 635-694.
Simms, L. J. (2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass, 2(1), 414-433.

Simms begins by claiming that self-report measures are the most commonly used kind of measures in psychology. He argues that this is because self-report methods are inexpensive, can be administered efficiently, and represent the most direct way of gaining access to people's thoughts, feelings and attitudes. The development of good self-report scales, however, takes much time and effort. Simms argues that there are three main approaches to self-report scale construction. First, the rational/theoretical approach, where "the scale developer simply writes items that appear consistent with his or her particular theoretical understanding of the target construct" (p. 415) [see Burisch's (1984) deductive approach above]. Second, the empirical, criterion-keying approach, where "items are selected for a scale based solely on their ability to discriminate between individuals from two groups of interest" (p. 415) [see Burisch's (1984) external/empirical approach]. Third, the internal consistency, or factor analytic approach, where the aim is to " identify coherent dimensions among large numbers of items written to sample one or more candidate constructs to be measured" (p. 415). Each approach has its own problems, so the would-be scale constructor should use an integrative approach [see Burisch's (1984) mixed approach]. The beginning of Simm's article covers old ground, and is essentially a condensed form of the Burisch (1984) article. Simm's major contribution is in his description of the integrative approach, which he breaks down in to three parts: (1) the substantive validity phase [including lit review, definition of construct, generation of initial item pool], (2) the structural validity phase [collect responses, analyze with statistical techniques], (3) the external validity phase [assess convergent, discriminant validity, write manual]. Extremely useful guidelines for item writing is also included.

Rossiter, J. R. (2002). The C-OAR-SE procedure for scale development in marketing. International Journal of Research in Marketing, 19, 305-335.

Scale Refinement

Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of clinical assessment instruments. Psychological Assessment, 7(3), 300-308. [psycnet link]

Smith and McCarthy define instrument refinement as instrument (measure) refinement "any set of procedures performed on an instrument designed to improve its representation of a construct" (p. 301). They claim that instrument refinement is neglected by researchers. This is problematic, they argue, because the purpose of instrument refinement is to address five issues: (1) the identification of the measures structure or dimensionality, (2) to ensure internal consistency of subscales in the event that the instrument is found to contain more dimensions than previously understood, (3) to maintain the instruments' content validity, (4) to ensure, if the scale is designed to differentiate amongst "normal" individuals and "clinical", the the scale contains items extreme enough to do so, and (5) to replicate the psychometric properties of the measure across independent samples (e.g. inter-item correlations, factor structures). On the issue of changes in construct definitions, Smith and McCarthy point out that "the evolution in conceptualization of a construct may require additional refinement of an existing measure" (p. 307). Thus they highlight that sometimes instrument refinement is driven by conceptual/definitional concerns, rather than statistical ones (e.g. poor internal consistency, poor predictive validity). On content validity, Smith and McCarthy have this to say: "Items reflecting unidimensional constructs should be homogeneous in content. Reliance on item statistics is insufficient for determination of content homogeneity; as Burisch (1984) has noted, only content considerations can separate items that are prototypic of a construct from items that are mere correlates of a construct" (p. 305)

Cronbach's Alpha

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach' s alpha. Psychometrika, 74(1), 107-120. [springer link full text]

On Construct Homogeneity

Smith, G. T., McCarthy, D. M., & Zapolski, T. C. B. (2009). On the value of homogenous constructs for construct validation, theory testing, and the description of psychopathology. Psychological Assessment, 21(3), 272-284. [pdf]

Guttman Scaling

Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9(2), 139-150. [jstor link]
Guttman, L. (1947). On Festinger's evaluation of scale analysis. Psychological Bulletin, 44(5), 451.
Guttman, L. (1960). Personal history of the development of scale analysis. In Levy, S. (Ed.), Louis Guttman on theory and methodology: Selected writings. (pp. 203-210). Aldershot, UK: Dartmouth.

Construct Validity Theory

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological bulletin, 52(4), 281. [google scholar link]
Smith, G. T. (2005). On construct validity: Issues of method and measurement. Psychological Assessment, 4, 396-408. [pdf]
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 1-25. [pdf]
Kane, M. (2000). Current concerns in validity theory. [pdf]
Borsboom, D., Cramer, A. O., Kievit, R. A., Scholten, A. Z., & Franic, S. (2009). The end of construct validity. The concept of validity: Revisions, new directions, and applications, 135-170. [google books link]

Validity

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4), 1061-1071.
Krause, M. S. (2012). Measurement validity if fundamentally a matter of definition, not correlation. Review of General Psychology, 16(4), 391-400.
McGrath, R. E. (2005). Conceptual complexity and construct validity. Journal of Personality Assessment, 85(2), 112-124.

Network Perspective

Schmittmann, V. D., Cramer, A. O., Waldorp, L. J., Epskamp, S., Kievit, R. A., & Borsboom, D. (2013). Deconstructing the construct: A network perspective on psychological phenomena. New Ideas in Psychology, 31(1), 43-53.

Schmittman et al. begin by explaining that there two common interpretations of the relationship between psychological attributes and the measures that are used to assess them. First, the reflective interpretation, which conceptualises the attribute as causing responses to an instrument. Second, the formative interpretation, which conceptualises the attribute as being caused by responses on an instrument. The classic example where the latter interpretation is present is in the case of socioeconomic status, where it seen as the “joint effect of variables like education, job, salary and neighborhood” (p. 1). The Network Perspective sees it differently. It proposes “that the variables that are typically taken to be indicators of latent variables should be taken to be autonomous causal entities in a network of dynamical systems. Instead of positing a latent variable, one assumes a network of directly related causal entities” (p. 5). Contrary to mainstream ideas in psychometrics “In a network perspective, causes do not work on a latent variable, and effects do not spring from it” (p. 9).cSchmitttman et al. argue that, if one adopts the network perspective on psychological phenomena, the kind of research the focus of research may shift: “if a construct like depression is a network, searching for the common cause of its symptoms is like searching for actors inside one’s television set”. What’s more, past psychometric concerns will become redundant: “ the question whether symptoms “really measure” depression, understood causally, is probably moot” (p. 9)

Statistics in Psychometrics

Wilson, M. (2013). Seeking a balance between the statistical and scientific elements in psychometrics. Psychometrika, 78(2), 211-236.
Rossiter, J. (2005). Reminder: A horse is a horse. International Journal of Research in Marketing, 22(1), 23-25.

General Measurement Texts

Fiske, D. W. (1971). Measuring the concepts of personality. Oxford, England: Aldine.
Lemon, N. (1973). Attitudes and their measurement. London: C. Tinling & Co
Nunnally, J. C. (1970). Introduction to psychological measurement. New York: McGraw-Hill.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. New York, NY: Psychology Press.
Rust, J., & Golombok, S. (2009). Modern psychometrics: The science of psychological assessment. London: Routledge.
Thorndike, R. L., & Hagen, E. P. (1977). Measurement and evaluation in psychology and education. New York: John Wiley and Sons. 4th edition
Wilson, M. (2005). Constructing measures: An item response modelling approach. London: Lawrence Erlbaum Associates.
Edwards, A. L. (1970). The measurement of personality traits by scales an inventories. New York: Holt, Rinehart, and Winston.
Cattell, R. B. (1946). Description and measurement of personality. London: George G. Harrap & Co.
Fiske, D. W. (1971). Measuring the concepts of personality. Oxford, England: Aldine.
Ghiselli, E. E. (1964). Theory of psychological measurement. New York: McGraw-Hill.

On the Assumptions and Philosophy of Psychometrics

Michell, J. (2000). Normal science, pathological science, and psychometrics. Theory and Psychology, 10 (5), 639-667. [pdf]

On Taxonomies and Dimensions

Meehl, P. E. (1999). Clarifications about taxometric method. Applied and Preventive Psychology, 8, 165-174. [pdf]

On Measurement of Self-Concept

Byrne, B. M. (1996). Measuring self-concept across the lifespan: Methodology and instrumentation for research and practice. Washington, DC: American Psychological Association.

Peg's Blog

Friday, 26 February 2016

A Long List on Measurement in Psychology and Related Disciplines