essay in this series showed how straightforward descriptive
statistics could be used in conjunction with a skill profile
based on a Likert-like scale to describe levels of competence,
and how a difference in the mean of observations could indicate a
difference in competence level. As useful and widespread as this
methodology is, that essay also pointed to some of the
statistical problems that this approach has embodied within it.
They include: a presumption that the distribution of observations
of competence on each variable fall in a roughly normal
distribution about the mean value on that variable; that each
variable has equal weight in the profile; and that statistics
that are normally used with interval data can be usefully applied
to this kind of ordinal data.
This essay begins to address these concerns by transforming the variable scale into a linear probability scale that is totally amenable to inferential statistics that might be applied to interval data. The transformation uses a Rasch partial credit or polytomous transformation that creates a linear interval scale; clearly shows the relative thresholds for each of the discrete items on the original scale allowing profiling of patterns that include dichotomous or polytomous choices; and clearly shows the relative weights of each of the thresholds on a common linear scale.
Likert scales, since their definition in 1932 by Renais Likert, have been used for respondents to indicate the extent to which they agree or disagree with a statement describing an attitude or opinion. It was originally designed for assessing attitudes to issues in management, like: "My job provides a lot of variety"; or, My job provides the opportunity for independent action".
The Likert scale is essentially a bipolar scaling method, in that it seeks to gather responses at the extremes of opinion or attitude so that it covers the whole gamut from "strongly disagree" to "strongly agree". The scale typically has five points, but sometimes a nine point scale is used to provide a finer granularity in the responses. In other test designs, the middle or neutral response is eliminated, forcing the respondent to choose to either agree or disagree with the statement, but eliminating the possibility of remaining neutral. The data in the five or nine point scale is clearly arranged in order of the strength of the respondent's agreement with the statement, and in this respect the data is "ordinal" data, where the order of the choice points is important, but there is no indication in the attitudes being expressed as to the relative distances between each of the points. In some analysis the data is further reduced by collapsing the data points into two categories of "accept" or "reject" the statement, whereby the data is reduced to nominal data.
The collapsed nominal data is useful only in as much as it indicates the degree of polarisation of opinion: It is used to show the percentage of people who agree or disagree with a proposition. A scale with purely nominal data can be used to gather pure demographic or background information, such as the state in which the respondent lives, or the class of car one drives. The analysis of these scales is usually presented in a simple bar graph form to display the distribution of attractiveness of the different name classes, but other than this descriptive analysis no other inferences can be drawn about the responses.
When the scale is arranged so that it describes a subjective impression of a single variable, then the order of impression quanta on the scale is important, but the subjective judgement can have no numerical value because of the qualitative nature of the impression being captured. The variable is based on an ipsative reference: each response shows a personal judgement which may vary considerably from that of other respondents marking the same scale. The level of aversion that evokes a "strongly disagree" response from one person might be quite different from that for another person.
Petroski (2006) gives this argument as the "obvious reason" that ordinal scales are not as meaningful if interpreted in the simplistically descriptive way of tradition. He explains thus:
The "obvious reason" for me is that ordinal scales do not have meaningful units of measurement. It is true that we are all comfortable with reporting "an mean change of k points" and tend to forget that "point" is meaningless. In my consulting role I have often asked a client "what kind of change is clinically meaningful?". This question most often comes up in sample size planning. I would guess that 4 out of 5 times even the users of ordinal scaled "measures" can not articulate what a meaningful change is. At some point it is big enough to care about, and at the other extreme it is small enough to be irrelevant, and there is a huge grey area in between. (Petroski G. 2006, Comment on the Rasch User Group mail list)
As with nominal data, the ordinal data can be described by a bar graph showing the distribution of probabilities, which will show if there is a central tendency (perhaps best described by a median rather than a mean), or whether there is polarisation, skew, or a uniform distribution representing an inconsistency in the choices. The degree of uniformity (or variation) across the distribution can be described by the range of the data, or the inter-quartile range, rather than the standard deviation which is reserved for interval data.
When the range of opinions is mildly divergent from neutral, the observations of opinion or attitude will tend to cluster about the middle of the range of opinion. However, when opinion or attitude is polarised a bimodal distribution occurs, making any measure of central tendency meaningless.
When trying to compare opinions between two classes of respondents, the exercise is made more difficult if the range, inter-quartile range, or variance is different between the two classes. If the data is polarised or heavily skewed, this difficulty is made worse since the variance around the ends of the scale as there is no alternative response to one side of the end point. Thus, the variance is parabolic, as with the binomial (Sclove, 2001), once again providing an indication that using calculations assuming interval data, like mean and standard deviation, are not appropriately applied to ordinal scales such as the Likert scale.
Because the judgements people exercise in responding to these scales are impacted by personal subjectivity there are biases that may intrude. "Respondents may avoid using extreme response categories (central tendency bias); agree with statements as presented (acquiescence response bias); or try to portray themselves or their group in a more favourable light (social desirability bias)." (Wikipedia, 2006)
Finally we come to the problem of the assumption that each variable scale has equal weight.
Bond & Fox (2001, pp. 66-68) use an example to illustrate the problem. Consider two questions that attempt to assess the respondent's aversion to computers. One could ask the respondent to express a subjective opinion on their agreement or otherwise to the statement, "I'm afraid that I will make mistakes when I use my computer". This question elicits a subjective response based on an aversion to computer use that does not prohibit using the computer. On the other hand, the question, "I'm so afraid of computers that I avoid using them" is eliciting a response to a much higher anxiety level about using computers: one that can indicate a phobic response to computers and a prohibition on computer use as a result. While the scales for the subjective response are the same, clearly the difficulty of agreeing or disagreeing with the statement is quite different. Further, the step size between scale points is likely to be quite different as well. Simply adding or averaging the responses to these questions to find a single figure that represents aversion to computer use will give a crude indication, but it is not a fair representation of the level of anxiety.
|Less Anxious||More Anxious|
This example illustrates how respondents on a subjective Likert scale can have different subjective responses to statements describing variables contributing to the trait of interest, in this case the level of anxiety about computer use. In this example, it is harder to agree that the anxiety is such that one would avoid using a computer altogether, although some respondents may indeed have that response.
In another example, drawn from a student satisfaction survey, is seems that question 22 was harder to endorse than question 8. It also shows that the range of responses in question 15 was larger than that of either of the other questions. Further, there is an indication in these data that the step size to strongly disagree in question 8 and question 15, and that to strongly agree in question 15, is larger than for other steps in the scale.
|Less Satisfied||More Satisfied|
Likert resorted to an artifice to linearise his scales. First he made several assumptions about the nature of the data:
- The level of agreement with specific statements could be used as a measure of one's attitude or opinion about the subject of the statements;
- Each subject is an independent item describing the trait of interest;
- Each item was an equal measure of the trait of interest;
- the questions and responses could be worded in such a way as to evoke a normally distributed pattern of responses, and that the extreme alternatives were such that they would attract responses from only those with the most extreme attitudes or opinions.
By asking several questions on different subjects related to the attitude or opinion of interest, Likert reasoned that he could quantitatively measure a person's attitude or opinion.
Likert assumed that the rating response to each item represented a separate measure of attitude. For a given population, Likert further assumed that an attitude… would be normally distributed. For each item, the mean of the attitude distribution for the population could be shifted from the overall mean, depending on sampling biases and confounding variables specific to the item. However, it was expected that the average distribution across all items would reflect the population's distribution of attitude. (Massof, 2001, p. 522.)
The response categories are ordered for each item, from the most negative to the most positive response to the item. This ordinal data then needed to be linearised, and Likert achieved this by transforming the response categories to normal standard deviation units (z scores or sigma units). Based on the assumption of a normally distributed pattern of responses, the position of each of the category thresholds can be estimated from the probability of that category being selected, itself estimated from the frequency with which subjects selected that category. Likert positioned each statement at the mid-point of the interval between thresholds. Because the first and last categories had only one boundary each, Likert arbitrarily assumed an upper and lower boundary at 99% and 1% probability respectively, to position the first and last statement categories.
Linearised in this way, Likert was able to argue that the sigma values, which were on an interval scale, were equivalent to the ordinal ranks of the response categories.
Therefore, Likert concluded, averaging response ranks is equivalent to averaging sigma values, and for all practical purposes, the average of response ranks across items produces a score that is equivalent to a measure on an interval scale. (Massof, 2001, p.523)
Massoff (2001, Figures 9 & 10, p. 254) shows that while the individual sigma scores for each item vary (in what appears to be a roughly normal distribution), the average of the sigma scores for the centre three decision points have a relatively linear relationship between the average sigma score and the response category. The assumption is made, therefore, that the linear relationship continues into the extreme response categories, but since these are unbounded this assumption is invalid. In fact, the relationship is more like an "S" curve, with a linear portion in the centre. Massof (2001, p. 523) shows that had Likert chosen 99.9% and 0.1% as the outside boundaries then the assumed linearity would not have held. Likert's justification for averaging response ranks, however, has been the basis of the very common practice with Likert scales for over 70 years. But the justification was not correct.
George Rasch, in the 1950s, showed that with dichotomously scored responses could be transformed such that the probability of a correct response to an item was an exponential function of the person ability, βn, and the item difficulty, δi. Masters (1982) extended Rasch's model to include Polytomously scored responses. This model shows that the probability of achieving a level having already achieved the previous level in an ordered sequence is similarly a function of the person's ability and item threshold difficulty. The Masters partial credit model is shown below in Figure 1. This model gives the probability, π, of person n scoring x on item i.
One interesting aspect of this model is that the person "ability" (or strength of opinion) variable can be determined independently of the specific items used in the test or rating scale. Similarly, the item difficulty (or strength of agreement) for each item is independent of the person abilities used to determine it.
Masters (1982, p.161. Figure 5) illustrates the operation of this model in the figure reproduced here as Figure 2. The top part of the figure shows the cumulative probability that a respondent might complete the first step (that is from no selection to category 1) in the leftmost curve (ogive or "S" shaped curve) having a slope at the centre of 1 and a threshold difficulty of δi1 logits. Similarly, the rightmost curve, representing a more difficult choice, has a higher difficulty threshold of δi2 logits.
The lower curve in Figure 2 shows the probability of selecting a category change at the intersection of the category probability curves for each item. It is these probability distributions that can be derived from the distributions of responses, by performing a transformation based on the partial credit model first, usually by the application of a computer programme (there are several suitable programmes) that performs the complex iterative numerical calculation to determine the position of each category of each item along a linear "logit" scale.
The probability distributions shown in Figure 2 are hypothetical and are used to illustrate the derivation of the partial credit model in the Masters 1982 paper. In practice some steps are harder to achieve than others; sometimes several steps are at a similar difficulty or are not well ordered in difficulty (although a simplifying assumption in Likert-like attitude scales is that the responses are well ordered); sometimes the probability ogives are steeper than the hypothetical mid-slope of 1, indicating a rapid transition between likely states as ability increases, or are shallower, indicating the intrusion of chance or other factors unrelated to the trait of interest into the choices or achievements. These considerations come into play when interpreting the transformed data for the reliability of measurement, fit to the model, and the validity of individual items in a test or rating scale. The model can be further simplified in the case of a rating scale, given the assumption of roughly equal step size within each item, to reduce the output to a difficulty value for the item rather than for each of the individual steps. These issues will be explored in more detail in a later essay in this series.
Clearly, the cumulative probability distribution, or ogive, for each step in the item in this model is not, in itself, a direct linear relationship. However, the transformation does place the item difficulty, δ, and person ability independently on a linear logit (or log odds unit) scale. This scale is a true interval scale, that does not need any artifice relating to category boundaries as the Likert analysis requires. As a computational convenience, the average value of logits for items in a particular data set is set to zero (0). A later essay in this series will address the issue of calibration of the logit scale against known and repeatable references to create a truly calibrated measurement of ability.
Some years ago I analysed the data from a satisfaction survey based on a Likert scale with 25 questions. Most of the questions were derived from the Course Experience Questionnaire designed by McInnis, Griffin, James & Coates (2001).
The data was analysed using the Quest computer programme to produce the output below.
In all there were 225 respondents to the survey and 26 questions. Six of the questions were specific to the particular survey and the remaining 20 were questions that had been developed for the Course Experience Questionnaire.
The logit, or log odds scale is shown a the far left of the graph in Figure 3, and ranges from -3 to +5. Next, represented by the "X"s, is the distribution of the overall satisfaction level indicated for each respondent. On the right hand side of the figure is the distribution of difficulty that respondents had in choosing a particular level numbered from 1="Strongly Disagree" to 5="Strongly Agree". In this case, level 3 is the neutral response. In other words, the presentation of the survey is in the traditional format of a Likert-like questionnaire with 5 steps.
For the item responses, the number to the left of the point indicates the item number: the number to the right of the point is the item level. The indicated step level is the threshold for that step where the likelihood that the step will be chosen changes from one level to the next. The graph in Figure 3 shows that the likelihood of a particular step being chosen is distributed roughly normally about a mean of the item thresholds that could be interpreted as the difficulty of choosing a level of satisfaction (the trait of interest). In this case a mean of the item difficulties at a particular level is the appropriate statistic since the distribution of item difficulties for that step is roughly normally distributed, and the scale is an interval scale. In comparing the mean step difficulties with those from the original pilot reported by McInnis, Griffin, James & Coates (2001) it appears that the mean step difficulties in this survey are very close to those in the pilot survey (the pilot was scored from 0 to 4 instead of from 1 to 5).
If the step distributions are compared with the curves in Figure 2, the δij thresholds are marked on the graph in Figure 3. The key threshold is that between level 2 and level 3; that is, between the choice of dissatisfied and neutral, where the opinion changes from a negative one to a positive one. Clearly, only four of the two hundred and twenty five respondents indicated any level of overall dissatisfaction as measured by this questionnaire.
Satisfaction Survey -------------------------------------------------------------------------------- Item Estimates (Thresholds) all on all (N = 225 L = 26) -------------------------------------------------------------------------------- 5.0 | | | X | | | | | 4.0 | | | | | 13.5 X | 14.5 | | 2.5 19.5 3.0 | 15.5 | 1.5 5.5 25.5 | 18.5 | 7.5 12.5 17.5 | 20.5 | 3.5 11.5 23.5 X | 10.5 24.5 | 16.5 21.5 2.0 | 4.5 6.5 26.5 X | 22.5 X | 9.5 | X | X | X | ____________________________Step 4-5 XXXXXXXXX | 8.5 13.4 18.4 1.0 XXXXXXXX | XXXXXXX | 14.4 25.4 XXXXXXXXX | 1.4 2.4 XXXXXXXXXXXXXX | 3.4 19.4 XXXXXXXXXXXXXX | 11.4 XXXXXXXXXXXXXXXXXXXX | 7.4 XXXXXXXXXXXXXXX | 4.4 21.4 22.4 XXXXXXXXXXXXXXXXXX | 14.3 15.4 XXXXXXXXXXXXXXXXXXXXXXX | 10.4 13.3 16.4 20.4 26.4 .0 XXXXXXXXX | 9.4 17.4 24.4 XXXXXXXXXXXXXX | 5.4 12.4 XXXXXXXXXXXXXXXX | 19.3 23.4 __________________Step 3-4 XXXXXXXXXX | 2.3 3.3 XXXXXXX | 4.3 6.4 8.4 18.3 XXXX | 14.2 22.3 XXXXX | 1.3 9.3 11.3 25.3 26.3 XX | 15.3 16.3 17.3 21.3 -1.0 XXX | 24.3 | 7.3 10.3 XXX | 5.3 6.3 20.3 23.3 _______________ XX | 8.3 12.3 __________________Step 2-3 Dissatisfied XX | 3.2 4.2 | 19.2 24.2 26.2 X | 18.2 22.2 23.2 X | 21.2 25.2 -2.0 | 6.2 11.2 13.2 16.2 | 7.2 17.2 | 2.2 5.2 9.2 20.2 | 12.2 15.2 | | 8.2 10.2 | | 1.2 -3.0 | -------------------------------------------------------------------------------- Each X represents 1 respondent ================================================================================
This example illustrates that, on a linear logit scale brought about by using the Rasch transformation of the initial responses:
- The range scale and item step sizes for each item are different;
- Over the whole questionnaire, the distribution of step difficulties at a particular level is apparently normally distributed on a linear logit scale, so a mean of the step difficulties will give a valid measure of the step difficulty of the test as a whole (compare the data in Table 2 with that in Figure 3);
- The difficulty level where step distribution curves cross can give an indication of the achievement, attitude or opinion on a linear scale that can be compared with a phenomenographical analysis of the activities required to achieve (or select) that level on the questionnaire; and
- The person attitude distribution is a representation of the attitudes of the respondents plotted on a linear scale alongside markers that indicate the transitions between negative, neutral and positive attitudes on the trait of interest.
This is just one data set, and a case could still be made that the data is being inappropriately analysed by using this transformation: after all, it is a complex transformation process, using a complex model, and I have made exactly this case for previous attempts to acquire a measure of a trait using profiles or Likert scales. Interestingly, using many of the same questions in the pilot, but a data sample quite removed from that used in this example, the authors of the Course Experience Questionnaire show mean step difficulties nearly identical (or at least within the scope of experimental error) to those calculated for the step difficulties in this sample. At least this points to reliability of the ruler to produce consistent results on different samples from a much larger population (one measure of measurement reliability). But we have further analysis before we can be more confident that the observation is more than just a chance occurrence.
As this scale stands, the choice of the zero mark on the logit scale is conveniently chosen for the data set: it is not referenced specifically to any external calibration – yet. We do have a means with this Rasch approach of developing a ruler on a linear scale that meets the conditions for conjoint measurement. (we still have to test for this condition being met, but more of that later). At this point though, we still have a way to go before we can actually call this set of observations a measurement of the trait of interest. To make this set of observations into a measure of a trait we need to determine some absolute reference that is universally available, and we need to calibrate the logit (or log odds) scale against this absolute reference in some agreed way, thus changing the interval scale into a ratio scale.
One way of setting the reference is to analyse the tasks required, typical opinions, or attitudes to achieve a particular level on the survey questionnaire (or test). Traditionally this has been done by using experienced people (the examiner, or a panel of experts in the trait being examined) to consider how a novice and an expert might approach the solution to a problem. The previous essay in this series addressed how this might be approached. Alternatively, one could measure the trait in a sample of the population covering the gamut from beginner, through novice and journeyman to expert, and see where each of these groups fits on the logit scale.
Before we can really take this calibration process much further though, there is still some work to do to check that the observations using the tools explored so far are both reliable and valid. That is the subject of the next essay in this series.
Bond, T. G. & Fox, C. M. (2001) Applying the Rasch Model. New Jersey, Lawrence Erlbaum Associates Inc.
Massof, R. W. (2001) The measurement of vision disability. Optometry and Vision Science 79(8) 516-52.
Masters, G. N. (1982) A Rasch Model for Partial Credit Scoring. Psychometrica, Vol. 47, No. 2. Pp. 149-174.
McInnis, Griffin, James & Coates (2001) Development of the Course Experience Questionnaire (CEQ). Canberra, DETYA. Accessed at: ceds.vu.edu.au/set/pdf/CEQ%20SES%20and%20SET.pdf on 1 September 2005.
Sclove, S. L. (2001) Notes on Likert Scales. Accessed at: http://www.uic.edu/classes/idsc/ids270sls/likert.htm on 22 April 2006.