Background

Lumbar spinal stenosis (LSS) is defined by diminished space for the neural and vascular elements in the central canal of the lumbar spine secondary to degenerative changes of the facet joints, ligaments, vertebrae, and intervertebral discs1 .Symptoms of neurogenic claudication including pain in the buttocks and lower extremities provoked by walking or extended-period standing1. About 60–80% of persons had low back pain at some time in their lives2, making LSS one of the most prevalent low back pain conditions. The treatment options range from non-surgical approaches such as medication, electrophysical agents, manual therapy, general exercises and spinal stabilization exercises3, to surgery.

The Zurich claudication questionnaire (ZCQ), also known as the Swiss Spinal Stenosis Measure or the Brigham Spinal Stenosis Questionnaire was developed in 1996 by Gerald Stucki et al.4. It is an 18-item, self-administered, disease-specific questionnaire that consists of three domains: symptom severity (SS), physical function (PF), and satisfaction (SAT)4. The original ZCQ’s SAT domain is only evaluated after surgery to gauge how satisfied patients are with their care; it is not used to gauge how therapy has changed outcomes. Therefore, several trials opted to use the SS and PF domains of ZCQ for outcomes evaluation, particularly with non-surgery therapies5.

For patients with LSS, the ZCQ has been shown to be a reliable and precise disease-specific questionnaire that has been translated and validated in several languages6,7,8,9. The Chinese national clinical practice guidelines for LSS10 and core outcomes sets for LSS clinical trials11 propose the use of the Chinese version of the Zurich Claudication Questionnaire (Ch-ZCQ), which has been Chinese cultural-adapted since 20146. Despite this, the Consensus-based Standards for the Selection of Health Status Measurement Instruments (COSMIN) checklist discovered that the low quality of the evidence made ZCQ difficult to use in LSS trials12. Moreover, there have been few attempts to confirm the validity of ZCQ with the Chinese population, particularly with regard to response validity or MCID for the SS and PF domains.

Thus, the objective of this research is to assess LSS patients receiving non-surgical therapy using the Ch-ZCQ scale. The COSMIN checklist was used to assess the Ch-ZCQ’s measurement properties, which included responsiveness, validity, reliability, MCID, and ceiling/floor effect.

Methods

A validity study was performed at a single center to validate the scale. The COSMIN methodology checklist13 was used to validate the Ch-ZCQ.

The study was approved by the Ethics Committee of Dongzhimen Hospital of Beijing University of Chinese Medicine, approval No.: DZMEC-KY2017-128.

Patients

Patients diagnosed with LSS14,between 50 and 85 years of age, getting 3 months’ non-surgical treatment (acupuncture, cupping therapy, tuina, epidural steroid injection, interference electrical therapy, hot compress, and oral analgesics) were included after providing informed consent. Patients with severe cauda equina syndrome, spinal fracture, lumbar tuberculosis, spinal tumors or rheumatoid arthritis, severe hematopoietic, cardiovascular, or endocrine system diseases, cancer, severe anxiety, severe depression, postoperative spine pain, menstrual or lactating women, or vascular intermittent claudication were not included. The study was conducted during 2021 to 2022 at the inpatients and outpatients.

Since the sample size was determined using an accumulation of five times the number of ZCQ variables (12 items), this study required at least 75 LSS patients after accounting for potential dropouts. The sample requirements for confirmatory factor analysis (CFA) state that a minimum of 100 LSS patients are required15. Thus, taking these two factors into account, we required a minimum of 100 patients.

Measurements

The Ch-ZCQ6, Oswestry Disability Index (ODI)16 and 12-item Short Form Health Survey Version 2 (SF-12v2)17 were administered before treatment, and 1.5-3 months after treatment.

The Ch-ZCQ consists of Symptom Severity (SS) and Physical Function (PF) domains with 12 items. All responses are reported on a Likert-type scale. It includes 7 items for SS with scores of 1 to 5, 5 items for PF with scores of 1 to 418. Higher scores indicate more severe LSS.

ODI contains 10 items and each item is scored on a 6-point scale (0–5). Item 8 was eliminated from the ODI for this study (sexual life). The ODI score is calculated by the following formula: (total score/(5 x number of questions answered)) x 100. A higher score indicates more functional limitations due to low back problems19. The Chinese version of the ODI was previously confirmed to be reliable and valid16.

SF-12v2 consisted of 12 items and evaluated 8 dimensions of health-related quality of life, including General Health (GH), Physical Functioning (PF), Role-Physical (RP), Bodily Pain (BP), Vitality (VT), Social Functioning (SF), Role-Emotional (RE), Mental Health (MH). Physical Component Summary (PCS) was calculated for GH, PF, RP and BP, and Mental Component Summary (MCS) was calculated for SF, RE, MH and VT. This study utilized a Chinese version of the SF-12v2, which has shown good reliability and validity20.

Change in score on SF-12v2 item 1 was used to evaluated responsiveness, which is the change in overall health status assessment before and after treatment. (“Overall, what do you think of your current health?”). Responses were “excellent,” “very good,” “good,” “moderate” and “poor”.

Measurement properties and statistical analysis

The analysis of the metric properties included reliability (internal consistency and reproducibility), content validity, construct validity, discriminant validity, and structural validity, as well as the analysis of the responsiveness and interpretability of this scale by following the COSMIN guidelines21.Additionally, the quality of the ZCQ was evaluated by using the current updated criteria for good measurement properties22.

Descriptive statistics were performed on all baseline scores and follow-up scores. The domain scores, total scores, and mean values of ZCQ, SF-12v2 and ODI scores were calculated.

For test-retest reliability, the level of agreement between two-time points was evaluated by using the intra-class correlation coefficient (ICC). Patients were filled in the form before admission and 1.5-3 months after treatment in the hospital. For calculations, we chose patients whose scores on item 1 of the SF-12 scale were unchanged following therapy. The coefficient ranged from 0 to 1, a coefficient greater than or equal to 0.7 was considered sufficient to determine test-retest reliability23.

For internal consistency, the homogeneity of the items within the domain was evaluated by calculating Cronbach’s alpha coefficients. A Cronbach’s alpha of 0.7 or higher was considered acceptable for internal consistency, while a score above 0.8 was considered good and above 0.9 was excellent22.

The content validity index at each item level was determined by the number of experts with a score of 7–10 and the total number of experts participating in the evaluation. The experts were requested to score the important evaluation of each item for LSS patients when the items were evaluated as outcomes following therapy, taking into account the items’ comprehensiveness, relevance, and comprehensibility. The content validity index at the scale and domain level was calculated as the average of all items within the scale or domain. The content validity index equal to or larger than 0.7 was considered high quality of the content validity24.

For construct validity, it is critical to establish evidence of the construct’s validity through correlations with external criteria25. In this study, the Pearson correlation coefficients of pretreatment and posttreatment assessments for SS and PF were used to evaluate the degree of correlation with external criteria (ODI and SF-12v2). It was anticipated that scales assessing related ideas would have a moderate to strong correlation whereas scales measuring dissimilar concepts would exhibit a poor correlation. We hypothesized that SF-12v2 would have a weak to moderate correlation with ZCQ and that ODI would have a strong association with ZCQ.

The correlation coefficient can be classified into five levels: very strong (r = 0.80 to 1.00), strong (r = 0.60 to 0.79), moderate (r = 0.40 to 0.59), weak (r = 0.20 to 0.39), and very weak (r = 0.00 to 0.19), respectively.

For structural validity, confirmatory factor analysis (CFA) was used to explore the underlying structure. CFA models were a preferred standard in testing many aspects of scale construction because they offer the ability to test model hypotheses26. It was considered as indicators of a well-fitting model when values greater than 0.9 for Comparative Fit Index (CFI) and Goodness of Fit Index (GFI), less than 0.08 for Standardized Root Mean Square Residual (RMSEA), less than 0.05 for Root Mean Square Residual (RMR), greater than 0.9 for Normed Fit Index (NFI), and greater than 0.9 for Non-normed Fit Index (NNFI)27.To test the models, a lower X2 value indicates a better fit, assuming an equal number of degrees of freedom (df).

For discriminant validity, patients were grouped based on their disease severity at baseline, as measured by the ODI. According to the ODI score, the degree of disability is divided into mild (< 56 points) and moderate to severe (> 56 points)28.ZCQ domains were compared across groups by using independent sample t-test.

Responsiveness was assessed by using the Effect Size (ES), Standardized Response Mean (SRM) and Change Rate (CR). The values of ES and SRM were approximately 0.2, 0.5, and 0.8, indicating low, medium, and high magnitudes of change over time29. CR can generally take 5%, 10%, 15%, or 20%30. The higher the ES or SRM are, the greater the level of sensitivity to detect change is.

For interpretability, the commonly used estimation methods of determining the MCID include the criterion method, the distribution method, the expert opinion method, and the literature analysis method31,32,33.Based on the high correlation between SF-12v2 and ZCQ scale, so we used the criterion method to evaluate MCID.

The existence of floor effect and ceiling effect will affect the evaluation of reliability and validity. If more than 15% of respondents obtain the minimum or maximum score, a floor or ceiling effect is considered to be present22.Floor and ceiling effects were analyzed by using ZCQ scores and by calculating the frequency of the lowest and highest possible scores.

All statistical tests were two-sided with a significance level of 5%. All analyses were performed by using SPSS 26 and Scientific Platform Serving for Statistics Professional, SPSSPRO (spsspro.com).

Results

Of the 105 patients enrolled at baseline, 75% completed the questionnaire sets after 1.5-3 months treatment with no missing data. The demographics and scale measurement values were shown in Table 1.

Table 1 Baseline and follow-up information of patients with lumbar spinal stenosis.

Reliability

The results of the internal consistency were good. The total Cronbach’s alpha coefficient of ZCQ baseline data was 0.874, and the Cronbach’s alpha coefficient after deleting a single item was between 0.855 and 0.884 (details in appendix 1). The Cronbach’s alpha coefficient of SS was 0.793, and PF was 0.870.

To analyze test-retest reliability, the study included 59 participants who had undergone treatment and exhibited no change in responses to SF-12v2 item 1 between baseline and the 1.5-3 months follow-up. Since patients received treatment within the measurement interval, the study’s reliability was not assessed under strictly stable conditions. This measure may not accurately capture patient changes. However, if a patient’s condition remains unchanged post-treatment compared to baseline item 1, the changes observed in the retest evaluation can be explained. The ICCs for SS and PF of the ZCQ were 0.836 (95%Cl 0.724–0.903) and 0.741(95%Cl 0.583–0.841) (time interval, days ± SD = 63.23 ± 22.23), respectively.

Validity

Content validity was assessed by 30 experts who scored the importance of the items in the Ch-ZCQ. The content validity indexes of the scale in this study were 0.764. Additionally, the content validity indexes of each item level ranged from 0.500 to 0.933. The content validity of this scale was considered good, except item 10, 11, and 12 (details in appendix 2).

The structural validity of ZCQ was presented in Table 2. To avoid the effects of treatment or time-related changes, we assessed structural validity using only baseline data. The baseline data model had good level of suitability.

Table 2 Structural validity of baseline data of the ZCQ.

In order to evaluate the construct validity, the correlation coefficients between the ZCQ and ODI, SF-12v2 domains were calculated (see Table 3). The ZCQ showed a strong correlation between its SS and PF. ZCQ and ODI were shown to be strongly positive correlated. Additionally, the SF-12v2 showed moderate negative relationships with ZCQ.

Table 3 Spearman’s correlation coefficients (95% confidence interval) between the ZCQ and the ODI and SF-12v2.

Discriminative validity

The results were presented in Table 4 after independent sample t-test had been conducted. There were significant differences in the degree of classification. It was observed that the ZCQ scale demonstrated good discriminative validity.

Table 4 Discriminative validity of baseline and follow-up data of ZCQ.

Responsiveness

To assess the responsiveness, the ESs between the ZCQ and external criteria (ODI and SF-12v2) were calculated. The ES was moderate in ZCQ SS, PF and ODI, with values of 0.46, 0.35, 0.21, respectively. While, the SRMs was low with values of 0.34, 0.25, and 0.12, respectively. Furthermore, the average CRs were 10%, 10% and 5%. The assessment failed to address the patient’s subjective reports of overall improvement. The difference in the SF-12v2 PCS and SF-12v2 MCS was not statistically significant (P > 0.05).

Interpretability

MCID was calculated by using the criterion method. Changes in 79 patients before and after treatment were analyzed using SF-12v2 item 1 as an anchor. There were 16 patients in the improved group, 4 patients in the deteriorated group, and 59 patients in the unchanged group. The mean value was used to estimate MCID. MCID SS= -0.21[95% CI (-0.36, -0.05)]; MCID PF= -0.16[95%CI (-0.36, -0.03)].

Floor and ceiling effects

There were ceiling and floor effects observed in the two domains of ZCQ. There were 6 items with ceiling/floor effect in the baseline data, and 8 items in the follow-up data (See Appendix 3).

Discussion

The Ch-ZCQ was translated and linguistically validated prior to this study6. The psychometric properties of the ZCQ were assessed by using the data collected from Chinese LSS patients. The Ch-ZCQ demonstrated good validity and reliability. The internal consistency of the ZCQ scale in this study was good. Construct validity assesses a scale’s accuracy in measuring a theoretical construct by comparing it with other scales, while structural validity evaluates the reasonableness of a scale’s internal structure using CFA. Our study demonstrated that the ZCQ scale had a good correlation with the SF-12v2 and the ODI, and the CFA model fit was good. The SS showed a moderate level of responsiveness, while the PF exhibited poor responsiveness. The above results indicated that evaluating the therapeutic effect of patients with degenerative lumbar spinal stenosis by using the ZCQ scale was meaningful. The MCID of ZCQ SS and PF were − 0.21, and − 0.16, respectively. The MCID is a patient-derived score that reflects meaningful changes in clinical interventions for patients. Understanding the MCID in the ZCQ score will help clinicians clarify the treatment effect.

In our study, the internal consistency of the ZCQ showed that the total Cronbach’s α (95% CI) was 0.874 (0.855–0.884), with the Cronbach’s α coefficient of SS was 0.793 and PF was 0.870.The previous studies also supported our findings, and suggested that the Cronbach’s α coefficients of SS were 0.787, 0.896, and 0.948, and the Cronbach’s α coefficients of PF were0.847, 0.866, and 0.968.

The test-retest reliability can reflect the consistency of the scale, which is primarily assessed by using the intra-group correlation coefficient. In the study, the test-retest reliability was found good, with ICCs of 0.836 and 0.741 in SS and PF. Previous studies have demonstrated that ICCs of SS and PF were 0.81 and 0.89 (the time interval of 3 months)7, 0.93 and 0.91 (3–5 days)6, 0.89 and 0.92 (one week)8.

In terms of construct validity, the study showed that strong correlations of SS and PF with ODI (0.646 ~ 0.817), and moderate correlations of PCS(− 0.527~ − 0.416) and MCS (− 0.506~ − 0.353) of SF-12v2. Nobuhiro Hara et al.7 found that SS and PF domains were strongly correlated with ODI (r = 0.63 ~ 0.75) and SF-36 physical function (r=− 0.65~− 0.28). Honglei Yi et al.6 study showed that strong correlation of ZCQ and SF-36 (r=− 0.685~ − 0.700).

In terms of discriminative validity, this study utilized the ODI to classify the severity of the condition. The results indicated that the ZCQ can discriminate the patients with mild to moderate/sever condition. This is beneficial for doctors as it allows them to quickly assess the severity of a patient ‘s condition in clinic.

In terms of responsiveness, the study showed that ZCQ could respond to the changes of patient who underwent the non-surgical treatment. Previous studies have demonstrated good responsiveness that the ES values for SS and PF were 1.737, 2.63 and 2.359,while the SRM values were 1.54 and 1.387,9.Low responsiveness in the study may be attributed to the different treatment methods. The baseline data from the study were similar to the two studies that used surgical treatment8,but this study employed non-surgical treatment. The low response may also be due to the varying follow-up time. In the references, the follow-up time ranges from 6 months to 1 year, whereas the follow-up period in this study is only 1.5-3 months. The low ES and SRM values may be attributed to the existence of floor/ceiling effects in the items.

In terms of MCID, the study showed that the MCID values for SS and PF were 0.21 and 0.16. Cleland Joshua et al.34 suggested that clinicians should consider using an MCID of SS and PF were 0.36 and 0.10. It is very similar to the findings of this paper.

The content and quality of the item have a direct impact on the content validity of the scale. Since the Ch-ZCQ is a translation, the items’ contents were identical to those of the original. Consequently, when the items were assessed as outcomes after therapy, the experts were asked to score the important assessment of each item, considering the items’ comprehensiveness, relevance, and comprehensibility, particularly for the conditions of the Chinese LSS population. The study showed that the score of item importance of items 10, 11, and 12 were below 7.0. Experts noted that most of these items concentrated on walking distance and contained repetitious information that was inappropriate for Chinese patients. The low significant index of content validity may also be explained by the manner that items 10, 11, and 12 displayed ceiling and floor effects.

The structural validity reflected the correspondence between the theoretical structure of the scale and the measured value. Despite the fact that half of the items in this study exhibited floor and ceiling effects, the confirmatory factor analysis (CFA) model fit was good. Although the CFA had been thought to be the most efficient method for assessing structural validity15, there were not many research on the topic at the time. The ZCQ Thai version’s exploratory factor analysis, which had identified four factors, somewhat confirmed our findings. The first and second factors had been related to the original ZCQ’s PF and patient satisfaction domains, respectively, while the third and fourth factors had dealt with pain symptoms and neurological disability, which were related to the original ZCQ’s SS domain36. Therefore, the Ch-ZCQ might have been an effective instrument for assessing the outcomes of LSS patients following non-surgical treatment, according to our evidence of structural validity and content validity, which were crucial for questionnaire development and psychometric assessment, particularly for cross-cultural adaptation.

Strength and weaknesses

Previous ZCQ measurement performance studies in China6, Japan7, Iran8, and Korea9 focused on surgically treated patients and lacked MCID evaluation, while our study has the following advantages. The ZCQ’s psychometric qualities were thoroughly evaluated in LSS patients receiving non-surgical therapies, offering important new information on the validity and reliability of the instrument in Chinese patient population. Additionally, this study fills a major gap in the literature by being the first to assess content validity, construct validity, responsiveness, and MCID in the Chinese LSS population. It also provides vital information for treatment evaluation and clinical decision-making. Furthermore, scientific rigor is ensured and the study is in line with global best practices for outcome measurement instrument assessment by using the COSMIN checklist and criteria to examine the ZCQ’s attributes. Overall, these strengths enhance the credibility and applicability of the study’s findings, contributing meaningfully to the field of LSS research and patient care.

There are also some limitations. Firstly, this study was conducted in Beijing, which may result in an over-representation of the urban population. Therefore, the population of other suburban areas of China needs to be considered. Secondly, we only assessed the known-group validity based on the changes in the ODI score. The ODI is a recognized scale for evaluating low back and leg symptoms, encompassing various domains, and it is both feasible and sufficient to be a criterion for discriminative validity35. Thirdly, patients whose overall rating has not changed during therapy are the target of the test-retest reliability assessment. For the computation, we chose patients whose SF-12v2 item 1 selection remained unchanged. This measure might not accurately capture patient changes, but if changes remain constant following therapy compared to baseline, it could account for test-retest evaluation alterations.