Analyses Statistical analyses were conducted using SPSS 14 for Windows

Although focusing on PD, the lack of systematic differences between people with PD and age matched controls, as well as between other health related respondent characteristics, suggests that our find ings are relevant beyond this context. The identified best categories for three, four, five and six category response scales were not optimal, as they failed to fulfill the assumption of equal inter category dis tances also when considering their 95% CIs. For example, the distances between Some of the time and A good bit of the time are clearly different from those between A good bit of the time and Most of the time. 9 and 74. 7, respectively. That is, the estimated distance between the latter two cat egories is about twice as large as that between the former two. Similar or more extreme situations are evident with scales such as the PFS 16, FACIT F, SF 36, PDQL, and PDQUALIF. Conceivably, this has at least two consequences. First, it may contribute to respondent difficulties in using the response options. Second, it is unknown what a certain difference in raw rating scale scores represents and by how much more someone has changed compared to peo ple with smaller change scores. This illustrates the ordi nal nature of raw rating scale data and argues against the legitimacy of analyzing and interpreting summed integral numerals from item responses as linear measures. This latter aspect represents a fact perhaps partly overlooked when developing rating scales, that is, the profound step that is taken when transforming words into numbers that typically are treated as linear measures. There are a number of aspects that need to be taken into consideration when interpreting the results pre sented here. The samples studied here were not randomly selected, which may limit the generalizability of results. Further more, the sample sizes were somewhat limited, which influences the precision of observations and, therefore, renders the reported 95% CIs wider than otherwise would have been the case. However, given that data failed to support the assumption of equal inter category dis tances even with consideration of the observed CIs, increasing the number of observations would presumably have yielded even stronger evidence against legitimate raw score summation of the response categories studied here. Similarly, the lack of differences between people with PD and control subjects, as well as between other subgroups also needs to be interpreted in view of the sample size.

That is, with increasing numbers of observa tions, statistically significant differences are increasingly likely to be detected. However, statistical significance says nothing about the practical significance of differences, which is not known for the current type of data. The variability in interpretations of response categories was wide between individuals. This does not appear to be limited to patient reported data, as studies regarding physicians interpretation of various probability related expressions have shown similar variability. This variability fur ther complicates score interpretation at the individual patient level. An important aspect in this respect is the extent to which interpretations are stable within individ uals over time. This needs to be assessed in further stud ies designed for this purpose.