10 Measurement

Measuring what we care about in the social world is often difficult. Attitudes, behaviors, and cultures do not easily lend themselves to being recorded succinctly as a column in a spreadsheet. Thus, a central concern with data in social science is measurement.

To distinguish what it is we truly care about from the things we are able to measure, we use the term construct to describe the concept or property we wish to study. By contrast, the data that ends up in our files is a variable—a term we’ve been using already throughout this book. For example, we can create a personality questionnaire to measure someone’s extroversion, but there will always be a gap (measurement error) between the values that end up in our spreadsheets—the variable—and the “true” value of the construct extraversion—a complex personality trait that is difficult to precisely quantify. For complex constructs that defy easy measurement, an operational definition describes a particular approach to practically measuring the construct. The distinction between construct and variable is particularly pronounced in psychology (where many variables of interest are difficult to measure precisely), so literature drawing on that discipline is where you are most likely to encounter this terminology. By contrast, suppose we are interested in something relatively easy to measure, like someone’s age. It is difficult to articulate a difference between the concept of age and the measured values of age, so the distinction between construct and variable is not particularly useful in this instance.

10.1 Validity and reliability

How do we evaluate whether a particular measurement approach is effective? We want measures that are valid, meaning that they (on average) reflect the underlying construct (a property known as construct validity). We also want measures to be reliable, meaning that they are precise and we get consistent results from the measurement approach.

There are many ways to evaluate validity, often identified as different types of validity. A full accounting is beyond the scope of this text, but two broadly-applicable examples are worth discussing. First, face validity refers to a qualitative judgement of whether the measurement approach appears reasonable. You can always ask yourself whether a measure makes sense, based on what you know about the topic being studied. Second, criterion validity refers to a measure exhibiting associations with other variables in expected ways. When we see that a variable tracks with other variables that should be interrelated, that builds some confidence that we have not gone horribly astray in our attempts to measure a construct.

Reliability is usually evaluated by repeating measurement in some manner and then comparing how similar the results are across the different measurements. If a measure is highly reliable, the various measurements should give us similar results (unless there’s reason to believe the true value of the construct has changed between measurement attempts). Various types of reliability scores can be calculated. While the details differ, they usually have a range of either 0 to 1 or -1 to 1, with 0 or -1 indicating no reliability and 1 indicating perfect reliability (equivalent scores from the different measurement attempts). Three common methods for estimating reliability are test-retest reliability, Cronbach’s alpha, and inter-rater reliability.

Test-retest reliability involves administering a measure once and then repeating the measure, usually at a later date. In order for the test-retest approach to make sense, we generally need to be measuring a highly stable construct (at least during the period separating the two measurements). For example, personality refers to a stable set of characteristics (at least in theory), so test-retest reliability is often used to assess measures of personality. By contrast, emotional states are generally more transient, so finding that someone indicates a different emotional state at two different points in time does not indicate that the measurement approach is unreliable; the subject may simply be experiencing a different emotional state than last time they were measured.

Cronbach’s alpha can be computed when multiple indicators are combined into an index that measures the construct of interest. The classic example is a survey with several items related to one construct (as in the measure of extraversion we have repeatedly referenced) or an exam with multiple problems. Cronbach’s alpha reflects the internal consistency of the indicators used to form the index. In other words, it tells us how similar our various indicators are to one another. Conbach’s alpha also increases—all else equal—as the number of indicators increases. So a 10-item index will have a higher Cronbach’s alpha than a 3-item index, assuming the two indices have items that are equally internally consistent. The reason for this is that as the number of indicators increases, the idiosyncrasies associated with individual items matter less to the overall index (just as larger sample sizes result in less noisy estimates). Whether this property of indices implies that we should generally use long multi-item scales to measure complex psychological or behavioral constructs is a topic of debate among survey researchers.

Finally, inter-rater reliability can be computed when multiple sources are rating (or coding) the same material. For example, a study might rely on multiple research assistants to rate the level of charisma exhibited by a speaker, using a rubric that details specific tactics of charismatic speech that are to be counted. One can use a measure of inter-rater reliability to determine how similar the ratings are from the different research assistants. This requires that there is a sample (could be a subsample) of speeches that have each been rated by more than one person, so that direct comparisons of the scores can be made. If all raters give the same score to every speech, there will be perfect inter-rater reliability. If raters give highly inconsistent scorings of the same speech, inter-rater reliability will be low.

For all types of reliability, researchers often rely on “rules of thumb” about what threshold (e.g., 0.8) a reliability score must reach to constitute “good” or “acceptable” levels of reliability. Trying to identify meaningful thresholds for the entirety of the social sciences is perhaps a hopeless tasks, since different constructs and types of measures allow for different realistic levels of reliability to be achieved. Within a given field, there will probably be established norms regarding acceptable levels of reliability.

Validity and reliability are both important. However, because reliability is often easier to evaluate quantitatively, you may find that more space is devoted to discussions of reliability than validity in many social science journals. Some scholars even argue that the scientific norms associated with scrutinizing reliability have led survey researchers to unjustifiably sacrifice validity in their scale development in order to achieve reliability levels that are deemed sufficient.¹

10.2 Scaling

Scaling refers to combining multiple indicates of a construct into a single variable called an index. The simplest scaling method involves taking the average (or sum) of the indicators. We call the result a summative index. While taking the average and taking the sum might seem like entirely distinct ways of creating an index, they are in some sense equivalent since each is a linear transformation of the other: divide the sum by the number of indicators, and you will have the average. Just as our results should not meaningfully change if we decide to measure something in inches instead of feet (Section 2.6), using a sum versus an average to construct an index will make not difference to our results so long as we remember to interpret our units correctly.

If the indicators don’t have a common scale (or even if they do), it is often a good idea to first standardize the items before combining them into an index. Some scaling approaches will automatically do this in the background, but if you are creating a summative index you may need to make this transformation first before calculating a sum/average.

Factor analysis refers to various methods for scaling that involve calculating different weights to apply to the various indicators. By contrast, with a summative index we are effectively applying an equal weight to all indicators, making it so that all indicators contribute equally to the final index. By assigning different weights, we make some indicators more important than others. This makes conceptual sense if we believe that some indicators are more precise or offer more unique information about the true value of the construct. Confirmatory factor analysis (as opposed to exploratory factor analysis) requires that you specify a measurement model indicating how various indicators are linked to constructs (as well as other linkages indicators may have to one another) and yields results that can be used as tests of whether the measurement model is plausible.

Principal component analysis (PCA, also called principal component factors or PCF) is a widely used technique that is often (mis)labled a type of factor analysis and accomplishes something similar, in that it creates an index based on calculating different weights for the indicators. Unlike confirmatory factor analysis, PCA does not require the user to map out a model of measurement. The basic intuition underlying PCA is that it selects values for weights in a way that maximizes the extent to which a common (latent) factor can explain the variation in the various indicators.

The factor loadings (or weights) from factor analysis or PCA will indicate how closely alligned each indicator is to the index. There are different ways in which these values can be reported, depending on the technique and what transformations might be applied. But generally speaking, loadings closer to 0 indicate less alignment of the indicator with the index. Negative loadings mean that an indicator is negatively associated with the index (e.g., an indicator of introversion should have a negative loading for an index of extraversion).

10.3 Measurement error

Measurement error usually distorts our ability to make valid estimates. An exception is that random measurement error in a dependent variable will not necessarily violate any regression assumptions since we can consider the measurement error to be part of the error term (so long as the measurement error conforms with the particular assumptions made about the error term). Unfortunately, measurement error often extends to our independent variables as well when we are examining data about the social world. This brings a serious source of concern regarding the validity of our estimate, including the validity of our inferential statistical results (confidence intervals and significance tests).

If we are only examining a bivariate relationship (e.g., how X relates to Y, without any control variables), then we can at least say that random measurement error in the independent variable should lead to attenuation bias, meaning that we will tend to underestimate the strength of an association. For example, if the actual correlation between two constructs is 0.6, attenuation bias means that we will systematically tend to get estimates that are smaller than this (e.g., 0.5 or 0.4). Attenuation bias is generally considered to be one of the least disruptive types of bias since it will lead to “conservative” estimates, meaning we will at least not overstate the extent to which variables are related. By random measurement error, I mean that the value of the variable’s measurement error is unrelated to the true value of either construct (and is also unrelated any the measurement error in the other variable).

Unfortunately, as soon as we move to the world of multiple regression (to be covered more in Chapter 11), random measurement error in the independent variables can easily lead to inflated estimates of associations (meaning the strength of an association is overstated) or even systematically wrong-signed estimates (e.g., a negative instead of a positive association). Generally speaking, it is difficult to correctly anticipate the direction of bias that might occur from measurement error (among independent variables) in the context of multiple regression.

Correlated measurement error generates similarly disruptive problems for estimation, even when looking at bivariate relationships. Thus, this type of measurement error is generally considered to be especially problematic. Correlated measurement error refers to errors in measurement that are correlated with underlying constructs or with errors in the measurement of other variables. For example, common method variance is a frequent source of potentially correlated measurement error in survey research. Suppose that we are using a survey of employees and want to estimate the association between one’s work motivation and job performance. If we rely on self-reported survey scales (a “common method”) to measure both variables, our variables will likely exhibit correlated measurement error. Respondents who think particularly highly of themselves (or wish to convey a positive image of themselves on a survey) are likely to overstate both their own motivation and their performance. They will have high values for both variables. Respondents with a more humble disposition will tend to report lower values for both variables. Thus, measurement error will likely push the association in a positive direction (high values of one variable paired with high values in the other variable, and low values paired with low values). This can lead to an association even if none exists in the underlying constructs.

Two main sets of tools exist that can create corrections for measurement errors. They emerge out of distinct traditions of statistical analysis emerging from the disciplines of psychology and economics. The psychology tradition has developed rather elaborate tools that utilize structural equation modeling (SEM) to estimate associations while accounting for measurement error. From the economic tradition, there is errors-in-variables regression, which allows for estimation of regression models that account for known error in the measurements of variables. Both sets of tools can be helpful for testing the sensitivity of findings to different assumptions about measurement error, but the tools are also somewhat limited in that they generally require strict assumptions about the nature of measurement error than cannot be fully tested.

10.4 Exercises

The “implicit association test” is a unique and widely-used method for trying to measure cognitive tendencies. For example, this test has been used to try to study people’s subconscious biases against certain social groups (e.g., women, African Americans). Some researchers have criticized the measurement approach, arguing that the measures barely correlate with actual behavior and therefore don’t seem to reflect real personal biases or beliefs. Using terminology we learned from this chapter, what property of these measures is being criticized?
Others have criticized the “implicit association test” because subjects who take the test multiple times often have fairly different scores across the different attempts. Using terminology we learned from this chapter, what property of these measures is being criticized?
I’m conducting a survey, and I use five separate survey items to ask respondents whether they support or oppose relatively aggressive law enforcement tactics (with Likert scale response options). After collecting responses, I combine the five variables into a single variable by averaging the five responses for each individual respondent. (a) What do we call the process of creating one variable from the original five? (b) What do we call the type of index that was created?
Which is generally considered to be a more serious problem: correlated or uncorrelated measurement error?

Clifton, Jeremy D. W. 2020. “Managing validity versus reliability trade-offs in scale-building decisions.” Psychological Methods 25(3): 259.↩︎