Newborn Hip Screenings at 4 to 8 Weeks Are Optimal in Predicting Referral and Treatment Outcomes: A Retrospective Review

Optimal DDH screening timing and whether adding risk profiles could aid in detecting treatment outcome were investigated. Risk factors were employed to supplement ultrasound findings in flagging cases for follow-up. Initial screening results and harness treatment outcomes concordance were compared at different screening ages and screening protocols. Using clinical decision to supplement ultrasound screening allowed to accurately flag all 12 DDH treated cases upon initial visit. Clinical decision correctly identified cases that would have otherwise been missed (n = 2). However, doing so increased the rate of false positive cases at all time points of initial screening. Initial screens were more accurate for predicting treatment outcomes when using ultrasound only if done after 28 days [≤28 days (88.1%) vs. 29 - 56 days (98.5%), OR = 7.16, p < 0.001] or ultrasound with clinical decision [≤28 days (86.4%) vs. 29 - 56 days (95.7%), OR = 3.00, p < 0.001]. In contrast, screening after 56 days failed to marginally improve accuracy compared to screens done between 29 - 56 days, regardless of the screening protocol employed. Two important trade-offs emerged. First, when choosing timing of initial screening, optimal accuracy and in determining cases that require further follow-up evaluation.

ficient predictive value. Although physical exams and risk factors were useful for targeted screening when screening for more severe hip types, the use of risk factors was not supported as an effective tool for selective screening [10]. Universal screening programs have gained support and can provide a reduction in the number of referred cases to orthopedic specialists and early detection of late presenting cases [11] [12]. In Austria, Thaller et al. [13] revealed that universal screening became cost-effective, when compared to selective screening, and when considering long-term healthcare costs and falsely identified screened cases [14]. Despite evidence showing that universal US is an effective screening method [15], means of improving screening accuracy and reducing unnecessary follow-up should be considered.
Screening timing is an important factor in accurately assessing the risk of developing DDH [16]. The UK's National Health Services Newborn and Infant Physical Examination (NIPE) and American Academy of Orthopedic Physicians both suggest that targeted screening be employed to detect high-risk DDH cases and screening should occur between 6 and 8 weeks [17], which is largely based on expert opinion [18]. Concerns for delayed screening are merited, with screening later than three months possibly reducing reliability and increasing missed cases [19]. While non-invasive treatment is not typically possible later than 6 months due to advanced hip dysplasia [4]. Severely abnormal hips may even have worse treatment outcomes when initiated after 8 weeks [20], yet treatment of hips has been suggested to be initiated between 8 to 12 weeks of age [21].
Ideally, initial screening age should balance between peak screening accuracy and treatment initiation.
Since optimal screening time, risk profiling and their performance in screening for DDH are not sufficiently understood, further research is needed to support the current recommended guidelines. Our study aimed to compare the screening performance when: 1) only using US findings, and 2) using risk factors to supplement US findings, and comparing performance at different screening age time points in order to find an optimal timing for initiating DDH ultrasound screening.

Sample
Data was retrospectively reviewed from a DDH registry database and was retrieved from all screening visits conducted between January 1, 2017 and December 31, 2017. Our sample consisted of all newborns attending one of nine medical centers or clinics for both self-referred and referred post-natal check-ups. Referred cases may have been referred to one of the nine participating medical centers/clinics due to indication signs of DDH risk factors or simply for a scheduled check-up. Any cases that had, a previous US performed at another site, or that were older than 6 months of age at initial screen, were excluded from our sample. The nine participating screening centers make up the Taiwanese Screening and Audit System for Developmental Dysplasia of the Hip (TSAS-DDH). The TSAS-DDH is a network of pediatric orthopedic specialists, general pediatric physicians, obstetricians, radiological technicians, pediatric nurses, and public health professionals that perform universal ultrasound screening for DDH, as well as holding training, validating data collection, and analyzing data, to develop an evidence-based screening program that can improve DDH screening outcomes in Taiwan and internationally. Informed consent was obtained from parents/guardians for newborns to participate in the screening registry. The database saw 3018 newborns recruited into the US screening program. Ethics approval was granted by the Institutional Review Board of Chang Gung Medical Foundation (IRB#: 201800670B0).

Hip Ultrasound
Hip US were assessed using Graf's classification technique, measuring Alpha and Beta angle from a standard coronal section of the hip as previously described [22]. The angles are coded as follows: Ia, Ib, IIa, IIb, IIc, D, III, and IV. Exact Alpha and Beta angle measurements and hip types have been published elsewhere [23]. Hip examinations were performed by trained DDH US operators and newborns were given a Graf classification code for both hips. A single code was later assigned to a newborn based on the hip with the most severe Graf type.

Screening Protocol
A universal US hip screening protocol with preliminary PE test and risk factors had a final US screen which was positive and were unreachable, were deemed to be lost to follow-up.

Variables
Timing of initial screening was reported in 3 time intervals (≤28 days, 29 -56 days, >56 days), due to previous studies showing that screening after 28 days can significantly impact screening accuracy [26], while screening after 8 weeks has been shown to have poorer surgical outcomes [20]. Only initial scans performed before 6 months were included, since performing surgery after 6 months is usually less effective, or not possible [4]. Descriptive data was collected including: sex (male and female), gestational number (singleton, multiple), term of birth (preterm or term), and screening physician's background (general practitioner and pediatric orthopedic specialist). The primary outcome for the study was whether Pavlik's harness treatment was required (Yes, No) and was coded as a binary variable (accurate or inaccurate) in relation to initial screen. Initial screening outcomes were categorized based on Graf hip classification and presence of risk factors and screening accuracy was coded on concordance with treatment outcome.

Statistical Analysis
Initial screening outcome and final screening outcomes were summarized by group percentages and counts. Descriptive data and screening outcomes were categorized by initial screening age groups compared by Fisher's Exact Test (X 2 ) of independence. The rate of harness treatment was further compared by initial Graf classification group and age at screening. We present comparisons of the simulated US Only protocol, and our actual US with CD protocol. Screening performance was given by reporting the stative counts and column percentages of false positive (FP), false negative (FN), true positive (TP) and true negative (TN) and accuracy. Univariate comparison by initial screening age was done for each using Fisher Exact Test (X 2 ). Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), negative likelihood ratio (NLR) were also reported to further demonstrate screening performance.
A multivariate logistic regression analysis was conducted comparing the likelihood of an initial screening outcome of accurately predicting whether harness treatment was required. The independent variable of interest was timing of initial screening. We used a pairwise logistic regression model which made contrasted between each time interval. All models controlled for the following covariates: gender, multiple fetal numbers, premature birth, and physician's background. Odds ratio (OR) and 95% confidence interval (95% CI) were reported. Moreover, Z-test of two proportions was performed to compare between screening protocols at each age interval.
Finally, a Receiver Operator Characteristics (ROC) curves were developed and the area under the curve (AUC), 95% confidence interval (95% CI) and p-values (p) were reported for performance of initial screens in predicting a treatment outcome stratified by age at initial screen, for each screening protocol. False negative rate was also calculated at each age and for both screening protocols.
An a priori two-tailed cut-off of significance of p < 0.05 was established for all analysis. Analysis was conducted using SPSS version 22.0 software.  Descriptive data were compared for all cases by initial screening age ( Table 1). The majority (n = 1941) were screened within 28 days of age, while 722 were screened between 29 days and 56 days of age, and 355 were screened after 56 days of age. Physician's background (X 2 = 133.608, p < 0.001), initial Screen Outcome (X 2 = 126.297, p < 0.001) and screening program outcome (X 2 = 15.095, p < 0.01) all differed significantly by screening age. In contrast, newborn sex, gestational number, and preterm birth did not differ significantly between screening ages.

Results
Among the 2982 cases that were not lost to follow-up, the rate of treatment among Graf classification groups and by age at initial screening were compared (   Screening performance for both US Only and US with CD screening protocols were described in Table 3. When using US Only, FP decreased significantly from 11.9%, to 1.5%, and 0.3%, as screening age increased (≤28 days, 29 -56 days, >56 Days, respectively). Two FN cases were found after 56 days, and the TN rate increased with age (87.7%, 98.2%, and 99.2%, respectively). Sensitivity was perfect except in the last screening age group due to no cases being correctly identified after 56 days, while specificity improved from 88.0%, 98.5%, and Open Journal of Pediatrics Conversely, PPV and PLR both increased as the screening age increased. Accuracy also followed a similar trend increasing with each age group 86.1%, 95.4%, and 97.7%, respectively.
Screening accuracy was further analyzed with a multivariate logistic pairwise regression model between initial screen and concordance with the final treatment outcome stratified by age groups of initial screening when using US Only (Figure 2(a)) and when using US with CD (Figure 2(b)). When analyzing US Only, significant increases in accuracy were detected between ≤28 days of age Lastly, results from ROC analysis (Figures 3(a)

Discussion
Using a US screening protocol with PE and risk factor profiles, led to all treatment cases being accurately detected at first screening. Most cases were identified before 28 days (n = 8), with two additional cases screened between 29 -56 days, and two cases screening after 56 days. Using US Only to select cases for follow-up would have resulted in two cases being missed that were correctly identified with the US with CD protocol. Despite, accuracy being higher when screening after 28 days compared to screening before 28 days, screening later than 56 days failed to marginally increase accuracy. Although employing an US Only protocol resulted in better accuracy in the 29 -56 day range, US with CD was able to correctly identify all cases that required harness treatment upon first screening.
We found that supplementing US findings with risk factor profiles to select cases that required further follow-up, resulted in all 12 cases eventually requiring harness treatment being correctly identified upon first screening visit. Current evidence for selected screening protocol has indicated that cases with breech delivery, family history of DDH or clinical hip instability detected from PE are at risk of developing late DDH [19] [24], on which we based our screening protocol to guide CD by a physician. However, rather than using risk for targeting US screening, risk screening protocol was used as an additional decision-making tool to select negative screened cases that were at risk of late developing DDH. Risk profiling has been shown to be moderately supported by evidence [27], and has also been promoted as a tool for detecting late developing DDH cases [28]. In our application of risk screening protocol, none of the DDH treatment cases were missed upon first evaluation, which provides support of conservatively screening out cases only if they have a negative Graf classification, have a negative PE, and are without risk factors for DDH. Further research is needed to confirm our findings and to develop a consensus on risk factor profile that could supplement US screening and improve accuracy of detecting cases requiring treatment.
US screens done > 28 days were found to be effective in improving screening accuracy. Past studies have found a similar trend in later screens with better screening performance with delayed screening protocols [16] [26] [29]. Delayed screening has gained support due to the fact that dysplastic hips are likely to resolve naturally without intervention [30]. However, we found that improvements in screening accuracy did not marginally improve when screening after 56 days compared to screening between 29 -56 days. Our findings overlap with the NIPE and the AAOP suggested screens between 6 -8 weeks [17] [18]. Roovers et al. [12] have suggested that screening between 2 -3 months may be beneficial for reducing overtreatment when screening with US. Furthermore, treatment for DDH >8 weeks is shown to have poorer screening and treatment outcomes [4] [20], since newborn hips undergo maturing (muscle tightness increases and capsular laxity decreases) in the 8 -12 weeks of age range [31]. In light of possible risks of delaying screening too much, and of our findings of reduced marginal improvements in screening accuracy after 56 days, screening should only be delayed insofar that accuracy is improved by delayed screening and that the effects on screening and surgery performance are limited.
Implementing risk screening into the follow-up clinical decision making revealed an important difference in screening performance. Screening performance was statistically superior when relying on ultrasound Graf classification findings to decide on cases requiring follow-up. Specificity and accuracy were higher in all age groups when assessed in the US Only protocol. Screening with risk profiling led to more screened cases being unnecessarily followed-up at all screening age time points. Sensitivity was found to be higher by Roovers et al. [12] and Mace et al. [32] Although earlier screens in both screening strategies had lower specificity than Mace et al. (99.8% to 99.9%), our delayed screenings (>28 days of age) had a similar performance. It is interesting to note that the mean age for cases from Mace et al. [32] was around 1 week; contrasting our earlier findings of higher accuracy at later screening ages. The discrepancy in our screening performance was likely due to relying on a more conservative screening approach; leading to higher number of cases being unnecessarily followed-up, and a decrease in accuracy, but with an ability to accurately detect upon first evaluation all cases that required harness treatment. Roovers et al. [12] also found when relying on delayed universal ultrasound Graf classification screening, missed positive cases still persisted (6/1000, 11.5% of DDH cases).
Despite this difference when using US with CD protocol, screening accuracy was not severely lower with significant decrease in accuracy observed in the 29 -56 day screening range, while other age groups were not significantly different. Clinical application is important to consider, since cases requiring surgery are potentially the most costly "inaccurately" screened cases due to subsequent treatment complexity and increased costs [14]. A clear trade-off emerged when implementing CD into US screening decision making, where using CD decreased accuracy but correctly identified all cases requiring harness treatment. Further research with a larger number in false negative cases is needed to confirm our findings in this observed trade-off.
Our study had some limitations. First, the design was observational and selection for screening age was not randomized. Higher risk cases may have presented for earlier screenings than lower risk cases, which may have introduced bias into our sample selection. A further study comparing consecutive cases, receiving repeated, blind measurements at different screening time points may be needed. Second, although our multi-center healthcare units were varied in settings, the sample chosen may not be generalizable to the general population. Additionally, 36 cases were unreachable for follow-up despite receiving an abnormal hip classification on their last screening visit. Most of these cases (n = 35) were Type IIa/IIb hips at initial screening. Cases may have chosen to attend another treatment site outside of our network. Thus, DDH treatment rate may have been underestimated in our sample. However, it is more likely that the cases resolved naturally with maturity, since the overall DDH incidence in our sample was 0.4%, similar to the incidence in the general population [3] [11] [33], and the likelihood that a case was present in the 36 lost cases was low (0.8%). Third, we compared a simulated screening protocol, which used the same populations but removed CD from the study protocol. Lastly, although we found significant differences between screening protocols, only 2 cases were missed (after 56 days), which may have been attributable to physician error. The majority of early screens were performed by general pediatric practitioners, previous studies have shown that screening operator experience plays an important role in screening accuracy, screening early leads to more variability in screening performance [5] and higher rates of over-treatment [6]. We controlled for physician experience and found that regardless of operator experience, delayed screening may improve screening performance. Thus, further research with sufficient sample size may be needed to confirm this finding.
With the ongoing debate between targeted and universal ultrasound screening, a universal screening approach which utilizes both delayed screening and risk profiling to support ultrasound assessment was supported by our findings. Two important trade-offs were identified. First, the timing of initial DDH screening should consider both screening performance and potential of impact on surgical outcomes of screening too late. Second, choosing whether to employ risk factors to aid in screening should consider both screening performance and the cost of missing cases that require harness treatment. Our findings reveal a window for optimal DDH screening in newborns between 4 -8 weeks (29 to 56 days), while employing risk profiles to guide follow-up decision making. Employing this screening method ensures the optimizing of screening accuracy, while preventing potential harm from late detection, and missed cases requiring harness treatment.