The Patient Health Questionnaire-9 among First-Trimester Pregnant Women in Japan: Factor Structure and Measurement and Structural Invariance between Nulliparas and Multiparas and across Perinatal Measurement Time Points ()
1. Introduction
Depression is a mental disorder commonly seen during pregnancy. The incidence of a DSM-IV Major Depressive Episode (MDE) during pregnancy is about 5% (Kitamura et al., 2006). If less severe depression such as Minor Depression and Intermittent Depression as defined by the Research Diagnostic Criteria (RDC: Spitzer et al, 1978) is included, its incidence may go beyond 15% (Kitamura, Shima, Sugawara, & Toda, 1993, 1996). The onset of antenatal depression is often characterised by psychosocial correlates such as a lack of support from a partner, poor accommodational conditions, an undesirable rearing experience in childhood, and the woman’s unstable personality (e.g., Kitamura, Shima, Sugawara, & Toda, 1993; Kitamura, Toda, Shima, & Sugawara, 1994b; Kitamura, Toda, Shima, Sugawara, & Sugawara, 1998). Therefore, antenatal depression is an important health issue that should be recognised by nurses and midwives.
Quite a few instruments have been used as screening devices to identify antenatal depression. They include the General Health Questionnaire (GHQ: Goldberg, 1972) (e.g., Kitamura, Sugawara, Aoki, & Shima, 1989; Kitamura, Toda, Shima, & Sugawara, 1994a), the Self-Rating Depression Scale (SDS: Zung, 1965) (e.g., Kitamura, Shima, Sugawara, & Toda, 1994c; Kitamura, Sugawara, Shima, & Toda, 1999), the Beck Depression Inventory (BDI: Beck et al., 1961) (e.g., Salamero et al, 1994), and the Edinburgh Postnatal Depression Scale (EPDS: Cox et al., 1987). Although the EPDS was originally developed as a screening tool to identify postnatal depression, it is often used to screen antenatal depression (Matthey et al., 2016; Wickberg et al., 2005). A drawback of these screening instruments is their relative inability to accurately identify cases that meet diagnostic criteria such as MDE. While its negative predictive value is usually high, its positive predictive value is poor. The Patient Health Questionnaire-9 (PHQ-9: Spitzer et al., 1999) is a self-report measure that can easily be converted to the diagnostic criteria of MDE. This is because all of the items of the PHQ-9 are derived from items of MDE. The PHQ-9 is a diagnostic criteria-based screening instrument. The psychometric properties of the PHQ-9 as a screening for MDE have been widely reported (Beard et al, 2016; Gilbody et al., 2007; Inoue et al., 2012; Manea et al., 2012; Wittkampf et al., 2007). Another strength of the PHQ-9 is its ability to predict depression severity. Among psychiatric inpatients (Kitamura, Nakagawa, & Machizawa, 1993), the number of MDE items identified as present was moderately correlated with the depression severity rated by interviewer-rated scales such as Hamilton’s Rating Scale for Depression (Hamilton, 1960) and the Global Assessment Scale (GAS: Endicott et al., 1976). PHQ-9 scores were also associated with severity ratings of other measures including those from the Center for the Epidemiologic Studies of Depression (Radloff, 1977; Beard et al., 2016) and functional difficulty (Kroenke et al., 2001).
When using a psychological scale as either a screening instrument or a severity measure, it is of great importance to examine its factor structure. For example, if it consists of more than one factor, then it is recommended to rate severity using subscales rather than a total score. If the factor structure differs, the scores on the subscales will be different and the care of interventions for the participants will be different.
The factor structure of the nine items of the PHQ-9 has been studied by several researchers. There are reports supporting a single-factor model (González-Blanch et al., 2018). In patients with spinal cord injury and depression, a comparison of different factor structure models indicated that a two-factor structure was the best (Krause et al., 2011). In this model, one factor included items of “sleep change”, “fatigue”, “appetite change”, “psychomotor agitation/retardation”, and “concentration difficulties”. In another study, one factor included somatic items of “sleep change”, “fatigue”, “appetite change”, the other factor including Non-somatic items (Hall et al., 2021; Keum et al., 2018). Using both patients with psychiatric issues and non-clinical populations, Doi et al. (2018) asserted that the two-factor bifactor model best fit the data. In this study, three items including “sleep change”, “fatigue”, and “appetite change” had high factor loadings on the first factor. However, no consensus was reached about the item configuration. Recently, the bifactor model has attracted researchers and clinicians. The bifactor model has a general factor and several group factors (subscales). Research and clinical implications of a bifactor model should be considered whether the model is basically unidimensional (therefore effects of group factors are negligible) or multidimensional (therefore effects of the general factor are negligible). The explained common variance (ECV) and a group of omega (ω) coefficients are useful indicators here (Reise, 2013; Rodriguez et al., 2016). The ECV is the proportion of common variance across all items that are explained only by the general dimension. Stucky and Edelen (2014) suggested that ECV values of approximately 0.85 or higher are needed to consider a set of items sufficiently unidimensional to warrant a one-factor model. Higher ωH, the more is suggested unidimensionality (Rodriguez et al., 2016).
Because of the worldwide use of the PHQ-9, configural, measurement, and factor invariance has attracted researchers’ concerns (van de Vijver & Leung, 2000). This is because the instrument will cast doubt about its validity if its factor structure (configural invariance), factor loadings of indicators/items (metric invariance), indicators’ intercepts (scalar invariance), and residuals of the indicators (residual invariance) differ significantly between those belonging to different countries (and cultures and languages). Measurement invariance includes metric, scalar, and residual invariances. Moreover, factor variances, factor covariances, and factor means of a psychological measure should be equivalent (structural invariance) between those belonging to different demographic characteristics if used as a means of comparison (Vandenberg & Lance, 2000). González-Blanch et al. (2018) used a primary care patient population and found that a two-factor structure model better fit with the data than a single-factor model. However, because of a strong correlation between the two factors on the two-factor structure model, they preferred the single-factor model. In addition, they confirmed measurement invariance of the single-factor model between men and women, age groups, marital status, level of education, and employment status. In addition, invariance of the factor structure should also be confirmed between two observation times (Widaman et al., 2010). Comparing English-speaking and Spanish-speaking women in the U.S., Merz et al. (2011) reported that a single-factor PHQ-9 factor structure demonstrated configural and factor variance equivalence. However, this study failed to confirm residual invariance. Factor mean invariance was not reported. Doi et al. (2018) studied patients with MDE and reported that a two-factor bifactor model was the best fit over other models and this model satisfied scalar invariance.
The use of the PHQ-9 in a population of pregnant women was reported in a few studies. Woldetensay et al. (2018) reported the validity of the PHQ-9 as a screening for antenatal depression, which was validated by a mental health specialist interview-derived diagnosis of MDE. In a population of pregnant Peruvian women (Zhong et al., 2014), exploratory factor analysis (EFA) indicated a two-factor structure that was supported by confirmatory factor analysis (CFA). In a population of pregnant Spanish women (Marcos-Nájera et al., 2018), however, a three-factor model fits better with the data than a two-factor model. Invariance of the PHQ-9 factor structure has not been reported. Even when used in pregnant women, the invariance of the factor structure of the scale (configural invariance), the factor loadings of the indicators and items (metric invariance), the intercept of the indicators (scalar invariance), and the residuals of the indicators (residual invariance) must be demonstrated. Invariance in terms of parity is important when taking into consideration that nulliparas and multiparas showed differences in many studies using structural equation modelling analyses (e.g., Kitamura, Ohashi, Murakami, & Goto, 2019; Kitamura, Ohashi, Sakanashi, & Tanaka, 2019).
To the best of our knowledge, the PHQ-9 has never been used in a population of pregnant Japanese women. Taking into account a high incidence of antenatal depression, it may be of clinical and research importance to study the psychometric properties of the PHQ-9 in a population of pregnant women including EFA and CFA, as well as the invariance of the identified factor structure.
2. Methods
2.1. Study Procedures and Participants
The target of this study was pregnant women at 10 to 13 weeks’ gestational age. Approximately 1500 pregnant women were recruited at the antenatal clinic of one general hospital and five private clinics located in Tokyo, Chiba, Ibaraki, and Kagoshima Prefectures in Japan. Exclusion criteria included women who: 1) were not fluent in Japanese; 2) were aged under 20; 3) had eating disorders; 4) had symptoms of vaginal bleeding or abdominal pain; 5) had subchorionic haematoma; or 6) had experienced recurrent miscarriages. We administered a set of questionnaires on two occasions, 1 week apart. The total sample consisted of 382 pregnant women (approximately 25% of those who were solicited) who participated in this study. Of these, 129 women responded to the retest 1 week later. Participation was voluntary and anonymity was assured. Witten informed consent was obtained from each participant. The mean age of the participants was 31.9 (SD 4.9) years old and that of their partners was 33.5 (SD 5.5) years old (Table 1). Most of them were married (94.5%), 44.0% of the participants were nulliparas and 55.0% were multiparas. The recruitment for this study was conducted from January 2017 to May 2019.
2.2. Measurements
PHQ-9 (Spitzer et al., 1999): We used the Japanese version of the PHQ-9 (Inagaki et al., 2013; Muramatsu, & Kamijima, 2009). This is a nine-item self-report depression scale based on the MDE criteria in the Diagnostic and Statistical Manual of Mental Disorder-IV (DSM-IV). Each item checks for frequency of depressive symptoms over the previous two weeks with a 4-point Likert scale from 0 to 3.
Sheehan Disability Scale (SDS: Sheehan, 1983): We used the Japanese version of the SDS (Yoshida et al., 2004). The SDS is a three-item self-report scale. Three disabilities in the domains of 1) work and school work; 2) social and leisure activities; and 3) family life and home responsibility are measured. Each item is rated from 0 to 10. Its psychometric validation has been reported previously (Arbuckle et al., 2009). The psychometric properties of the SDS among the present
Table 1. Demographic features of the participants.
sample were reported elsewhere (Hada et al., 2021).
2.3. Data analysis
The whole sample was randomly divided into two groups: Group A (n = 184) for EFA and Group B (n = 198) for CFA. Among Group A, we calculated mean, SD, skewness, and kurtosis of each PHQ-9 item. As seen in the results, excessive skewness > 2.0 required log transformation of the PHQ-9 items. These log-transformed PHQ-9 items were subjected to EFA. Factorability of the items was checked by the Kaiser-Meyer-Olkin (KMO) index and Bartlett’s sphericity test (Burton & Mazerolle, 2011). We then performed a series of exploratory factor analyses (EFAs). This was done by the maximum-likelihood method with PROMAX rotation starting from a single-factor structure model and progressing to models with a greater number of factors (i.e., two- and three-factor structures, and so on). These models were compared with goodness-of-fit in a series of confirmatory factor analyses (CFAs) among Group B. The fit of the models was examined in terms of chi-squared, comparative fit index (CFI), and root mean square of error approximation (RMSEA). A good fit would be indicated by χ2/df < 2, CFI > 0.97, and RMSEA < 0.05, and an acceptable fit by χ2/df < 3, CFI > 0.95, and RMSEA < 0.08 (Bentler, 1990; Schermelleh-Engel et al., 2003). A model was judged better than another if its Akaike information criterion (AIC; Akaike, 1987) was lower than the other.
Comparison of factor structure models derived from EFAs, literature model and bifactor model were performed as cross validation (Cliff, 1983; Cudeck, & Browne, 1983; Romera et al., 2008) using the second halved sample, Group B. Starting with the single-factor model, the subsequent model was judged as “accepted” if χ2 decreased significantly for the difference of df. This was repeated until we reached the best model. After deciding on the best-fit model among first-order models, we built a bifactor model. Then we calculated ECV and ω indices of this model to determine whether the model was basically unidimensional of multidimensional. The ECV is the proportion of common variance across all items that are explained only by the general dimension. Stucky and Edelen (2014) suggested that ECV values of approximately 0.85 or higher are needed to consider a set of items sufficiently unidimensional to warrant a one-factor model. Omega (ω) indicates the proportion of variance of the whole measurement explained by the general and all the group factors. Omega subscale (ωS) indicates the proportion of the variance among items of each specific group factor explained by both the general and the group factor. The proportion of the variance of the whole measurement explained only by the general factor is termed omega hierarchical (ωH). Omega hierarchical subscale (ωHS) indicates the proportion of the variance among items of each specific group factor explained by the group factor. Higher ωH, the more is suggested unidimensionality (Rodriguez et al., 2016).
The bifactor model’s configural, measurement, and structural invariance was examined across different attributes (parity and observation occasions) among the whole sample. This started from configural invariance, through metric, scalar, residual, and factor variance invariances to factor covariance invariances. The progress from one step to the next was judged as “accepted” if 1) the χ2 decrease was not significant for the df difference; 2) the decrease of CFI was less than 0.01; or 3) the increase of RMSEA was less than 0.01 (Chen, 2007; Cheung & Rensvold, 2002). This procedure was applied because a χ2 decrease is strongly sensitive to the sample size (N) and, particularly in the case of a large sample, produces an unreasonable rejection of invariance.
2.4. Ethical Consideration
This study was approved by the Institutional Review Board (IRB) of the Kitamura Institute of Mental Health Tokyo (No. 2015052301) and Kagoshima University (No. 170247).
3. Results
Mean, SD, skewness, and kurtosis of each PHQ-9 item in Group A is in Table 2. Two items (item 8 “psychomotor symptoms” and item 9 “suicidality”) were excessively skewed. Hence, all of the PHQ-9 items were log transformed for the subsequent analyses.
After finding the data was factorable, KMO = 0.816 and Bartlett’s sphericity test χ2 (36) = 465.199 p < 0.001, we performed EFAs (Table 3). In the single-factor model, all the items showed factor loading > 0.30 (Costello & Osborne, 2005). The two-factor model looked similar to previous studies in which the first factor loaded somatic items including “Sleep change”, “Fatigue”, and “Appetite change”, together with “Concentration difficulty”. In the three-factor model, the third factor included only two items, “Psychomotor symptoms” and “Suicidality”. The factor structure in which has factor(s) including only two indicators is
Table 2. Mean, SD, and skewness values for each PHQ-9 item (n = 184).
weak structure. Hence, the three-factor model was rejected as a model of the PHQ-9 factor model.
We then compared models derived from EFAs, two-factor somatic-non-somatic model and bifactor model in CFAs among Group B women. The two-factor somatic-non-somatic model is superior and good (Table 4). In addition to a significant decrease of χ2 (Δχ2(df) = 23.137 (1), p < 0.001), both CFI (from 0.914 to 0.977) and RMSEA (from 0.104 to 0.055) were better in the two-factor somatic-non-somatic model. We further built a bifactor model (Figure 1). The model fit in terms of the goodness-of-fit was improved: χ2/df = 1.030 (18), CFI = 0.999, RMSEA = 0.012. We thought that the bifactor model was the best.
We then calculated ECV and omega indices (Table 5). Since ECV was 0.761, we considered that the items for PHQ-9 were reasonable to consider as multidimensional (Stucky & Edelen, 2014). Similarly, ω coefficiemts suggeted multidimensionality: ω = 0.868, ω. Somatic factor = 0.756, ω. Non-somatic factor = 0.817. ωH = 0.795, ωHS. Somatic factor = 0.330, ωHS. Non-somatic factor = 0.032.
We then tested whether the bifactor model was invariant between nulliparas (n = 168) and multiparas (n = 210) using the whole sample. Five cases were not known for parity. The bifactor model showed configural, measurement, and structural invariance (Table 6). Factor means were significantly different between the
Table 3. EFA of the PHQ-9 (n = 184).
Note: Factor loading > 0.3 in bold.
Table 4. Comparison of factor structure models by CFA (n = 198).
Note: Indices of Somatic-Non somatic model was compared with 1-factor model; **p < 0.01; ***p < 0.001; CFI: Comparative fit index; RMSEA: Root mean square of error approximation; AIC: Akaike information criterion.
PHQ-9: Patient Health Questionnaire-9; CFI: Comparative fit index; RMSEA: Root mean square error of approximation; AIC: Akaike information criteria. Paths are standardised. Significant paths are in bold. The names of error variables are deleted. Paths with significant (p < 0.001) coefficients are in bold.
Figure 1. Confirmative factor analysis of the PHQ-9 (n = 198).
Table 5. Omega indices for the bifactor model.
Note: ECV = 0.761.
Table 6. Configural, measurement, and structural invariance of the two-factor model between nulliparas (n = 168) and multiparas (n = 210).
NS: Not significant; CFI: Comparative fit index; RMSEA: Root mean square of error approximation; AIC: Akaike information criterion.
nulliparas and multiparas only for Non-somatic factor (Somatic factor, difference (SE) = −0.024 (0.038), NS; Non-somatic factor, difference (SE) = −0.107 (0.036), p < 0.01; General factor, difference (SE) = −0.015 (0.046), NS).
The bifactor structure model was also examined in terms of its invariance between the initial (n = 377) and follow-up (n = 126) occasions (Table 7). Here again, the bifactor model showed configural, measurement, and structural invariance. Factor means were not significantly different between time 1 and time 2 (Table 8).
In the CFA, Somatic factor and the General factor were significantly correlated with the SDS factor (r = 0.24 and 0.71, respectively), whereas Non-somatic factor was not. It suggested that the severity of depression was associated with disability (Figure 2).
4. Discussion
Our study demonstrated that the PHQ-9 among pregnant women had a bifactor structure. This is in line with the previous literature (Doi et al., 2018; Hall et al., 2021; Keum et al., 2018). A different pattern of a two-factor structure of the PHQ-9 was reported by Krause et al. (2011); however, that sample included those suffering from spinal cord injuries and was, therefore, more likely to complain about physical-disease-derived symptoms such as moving and concentration difficulty that may have mingled with the somatic factor of PHQ-9. González-Blanch et al. (2018) claimed a single-factor model. However, in their study, the CFI was substantially better for the somatic-non somatic model (0.97) than the single-factor model (0.91). Here, the same three PHQ-9 items (items 3, 4, and 5) were grouped in the same factor. Their argument rejecting the two-factor model
Table 7. Configural, measurement, and structural invariance of the two-factor model between two observation occasions.
NS: Not significant; CFI: Comparative fit index; RMSEA: Root mean square of error approximation; AIC: Akaike information criterion.
Table 8. Factor mean for the PHQ-9.
Note: *p < 0.05; **p < 0.01; ***p < 0.001; NS: Not significant; SE: Standard error.
SDS: Sheehan Disability Scale; PHQ-9: Patient Health Questionnaire-9; SDS: Sheehan Disability Scale; CFI: Comparative fit index; RMSEA: Root mean square error of approximation; AIC: Akaike information criteria. Paths are standardised. Significant paths are in bold. The names of error variables are deleted. Paths with significant (p < 0.001) coefficients are in bold.
Figure 2. Confirmative factor analysis of the PHQ-9 and SDS (N = 382).
was based on a strong correlation (r = 0.86) between the two factors. However, our results indicated the bifactor model of the PHQ-9 was the best. ECV of our data suggested a set of items for PHQ-9 was reasonable to consider as multidimensional. ωHS showed low values: ωHS. Non-somatic factor was extremely low (0.032). Item 1, 2, 6, 7, 8, and 9 are little explained by Non-somatic factor but by the General factor. All items are more strongly explained by the General factor than each group factor (subscale). This is the charactor of construction of factor for the PHQ-9.
Our results also supported invariance of the model in terms of parity as well as observation time. Invariance of the factor structure was present at the configural, measurement, and structural levels. On the factor mean level, there was no difference between nulliparae and multiparae in General factor. Regarding general severity of depression, this implies that the PHQ-9 measures the same phenomena during pregnancy regardless of parity and observation time. Disability was significantly correlated with general factor and somatic factor of PHQ-9, but Non-somatic was not collated. This implied that disabilities in pregnancy are affected by somatic symptoms (i.e., sleep change, fatigue, and appetite change) and general severity of depression.
The robust factor structure regardless of parity and measurement time reported in this study suggests that the PHQ-9 may be used among pregnant women as a tool for screening antenatal depression as well as a measure of depression severity. Depression during pregnancy is not infrequent, causes psychological and functional maladjustment, and is often associated with a variety of social hardships such as narrow accommodation and the lack of a partner’s support (Kitamura, Shima, Sugawara, & Toda, 1996). Accurate identification of cases of depression during pregnancy can lead to better antenatal psychological care by midwives.
Precise identification of cases of antenatal depression may lead to early intervention by perinatal mental health professionals. Because of possibility of foetal malformation by drug therapy, main means of therapeutic approaches are psychotherapies. These include interpersonal therapy (Stuart & Koleva, 2014). Taking into consideration the importance of the pregnant woman’ partner, it may be recommended to apply partner-assisted psychotherapy (Brandon et al., 2012).
Our study was not without limitations. First, the participants were outpatients who were pregnant women. They were unlikely to have serious psychiatric disorders. The sample size was relatively small. Moreover, our sample was women at 10 to 13 weeks’ gestational age. It is necessary to replicate the study among women in the later stages of pregnancy. Validation of the results should have been conducted with the use of a structured diagnostic interview.
Despite these drawbacks, we think that the PHQ-9 is a promising tool as a measure of depression severity in pregnancy.
Acknowledgements
We are grateful for all of the participants and the Japanese Red Cross Medical Centre, Endou Ladies Clinic, Kubonoya Women’s Hospital, Tsuchiya Obsteric & Gynaecology Clinic, Aiiku Hospital, and Nakae Obstetiric & Gynaecology Clinic.
Author Contributions
MM and TK set up the research design. MM, AH, and MW collected data. MW, AH and TK analysed data. MW, AH and TK wrote the manuscript.
Data Availability
The datasets used and/or analysed during the current study are available from the corresponding author upon request.