Diagnostic Value of Dual-Energy CT in Differentiating Malignant and Benign Thyroid Nodules: A Systematic Review and Meta-Analysis

Objectives: To evaluate the diagnostic performance of the quantitative iodine parameters, including Iodine Concentration (IC), Normalized Iodine Concentration (NIC), and λ HU , in distinguishing malignant and benign thyroid nodules. Methods: Relevant studies were searched from Web of Science, PubMed, Embase, Cochrane Library, China National Knowledge Infrastructure database and other complementary sources from inception to May 20, 2020. Study selection, data extraction, quality assessment, and data analyses were performed following the Cochrane standards and the PRISMA-DTA guideline. Results: Eight studies were included (595 patients with 737 thyroid nodules). The pooled sensitivity, specificity and summary diagnostic odds ratio of IC were 79% (95% CI: 69% - 86%), 76% (95% CI: 65% - 84%) and 11 (95% CI: 5 - 27) respectively; those of NIC were 78% (95% CI: 70% - 85%), 80% (95% CI: 74% - 85%) and 15 (95% CI: 9 - 24) respectively; those of λ HU were 80% (95% CI: 71% - 87%), 77% (95% CI: 70% - 83%) and 14 (95% CI: 8 - 24) respectively. Conclusion: DECT can be a potential evaluation tool for thyroid nodules. The NIC may be the most sensitive iodine parameter and could be comparable between different DECT machines in thyroid nodule assessment.


Introduction
Thyroid nodules are common clinical findings in recent years, and the prevalence of thyroid nodules is 20% -70% in the general population [1] [2] [3]. Autopsy results showed that thyroid nodules are present in about 50% of the population [1] [4] [5]. Recent statistics show that the incidence of thyroid diseases is rising rapidly in many countries, while mortality is relatively constant [2] [3] [6].
Currently, cytology is the gold standard for diagnosing thyroid nodules, but ultrasound-guided fine-needle aspiration (FNA) is invasive, with certain risks and limitations [7] [8]. Although most of the complications of FNA biopsy, including perithyroidal hemorrhage, and parenchyma edema, were reported to be mild and could be recovered within a few hours, the discomfort and the risk of tumor metastasis during aspiration should not be ignored [7] [8]. A non-invasive examination could reduce pain or discomfort or prevent the probability of cancer seeding along the track of the needle. Besides, it may reduce patients' mental pressure by painlessly recognizing the relatively low probability of having malignant diseases, which may be beneficial to patients' disease management and clinical results [9].
Conventional gray-scale ultrasonography (US) is widely used to detect thyroid nodules. It can estimate benign and malignant solid thyroid nodules based roughly on morphologic features, such as nodule shape, margin, size, and calcification [10]. However, we could not omit the relatively low interobserver reliability and the lack of convenient lexicon in US to characterize thyroid nodules. And US elastography, a new advance in US, might not be an ideal tool to differentiate the types of thyroid nodules either [11]. Single-energy computed tomography (SECT) can evaluate the size of nodules, but the correlation between the nodules size and the nature of nodules is unclear [12].
With the recent advance in dual-energy CT (DECT), dual-energy imaging is burgeoning in clinical practice. DECT provides information on material decomposition by simultaneously acquiring images at two different energy levels in one scan. It can quantify intrinsic attenuations related to different atomic numbers and tissue density in specific regions of interest (ROI) [13]- [18]. Evaluating thyroid nodules using DECT is a novel non-invasive technique. It can provide a quantitative measurement of iodine concentration, which may be related to the pathophysiological status of thyroid nodules [19] [20].
Several studies have attempted to evaluate this application [21] [22] [23]. However, due to the different techniques used, the value of relevant quantitative iodine parameters and the diagnostic accuracy were not determined, and it is worth exploring. Therefore, we conducted a meta-analysis to evaluate the diagnostic value of the quantitative DECT parameters in distinguishing malignant and benign thyroid nodules.

Search Strategy
Based on the "PICO principle" of evidence-based medicine, we constructed a question that defines population, intervention, comparison, and outcome: for patients with thyroid nodules, whether quantitative DECT iodine parameters can be used to distinguish benign and malignant thyroid nodules as effectively as cytology. And we followed the preferred reporting items for systematic reviews and meta-analyses of diagnostic test accuracy studies (PRISMA-DTA) guidelines to conduct our meta-analysis [24] [25]. We used the following keywords to set

Inclusion and Exclusion Criteria
Inclusion criteria: 1) Assess the use of Dual-energy CT in evaluating the thyroid nodules, and contain at least one quantitative parameter, including iodine concentration, normalized iodine concentration and Slope of spectrum curve; 2) a proper reference standard was used. Malignant thyroid nodules need to be pathologically confirmed, while benign thyroid nodules could be confirmed based on imaging follow-up of more than 12 months as an alternative; 3) sufficient data to construct a 2 × 2 contingency table to calculate diagnostic accuracy; 4) The imaging data of patients are preoperative or pre-treatment (such as chemotherapy) data.
Inclusion criteria: 1) Assess the use of Dual-energy CT in evaluating the thyroid nodules, and contain at least one quantitative parameter, including iodine concentration, normalized iodine concentration and Slope of spectrum curve; 2) a proper reference standard was used. Malignant thyroid nodules need to be pathologically confirmed, while benign thyroid nodules could be confirmed based on imaging follow-up of more than 12 months as an alternative; 3) sufficient data to construct a 2 × 2 contingency table to calculate diagnostic accuracy; 4) The imaging data of patients are preoperative or pre-treatment (such as chemotherapy) data.

Data Extraction
Two reviewers independently selected and extracted the data related to the study in the form of a data table. The data included first author, year of publication, country of origin, and research type, the patient's data (the number of cases, the sex ratio, and the average age of patients), the nodules data (the pathological type of nodules, number, nodule size), the mean and standard deviation of re-

Assessment of Study Quality
The quality of the included literature was evaluated according to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) [26]. Each literature will be evaluated on the basis of 14 criteria. Since the use of total scores may manifest bias itself, we do not calculate the scores but show the methodological quality evaluation of each study through a summary chart [27]. If there is a dispute over the quality assessment results, another senior clinician or statistician will be invited to discuss and reach a consensus [28]. We choose Begg's test as a quantitative indicator to evaluate publication bias, with p < 0.05 indicates the possibility of potential publication bias. Furthermore, we also try to minimize the publication bias through comprehensive search strategies.

Statistical Analysis
The primary purpose of this systematic review is to evaluate the DECT quantitative parameter's diagnostic capability by calculating pooled sensitivity, pooled specificity, and summary diagnostic odds ratio (SDOR). We use bivariate model to create summarized receiver operating characteristic (SROC) curves for each group. The diagnostic parameters of benign and malignant nodules will be estimated if data have a good homogeneity. Also, we attempt to explore the sources of heterogeneity.

Test of Heterogeneity
In the diagnostic meta-analysis, the sources of heterogeneity mainly include diagnostic threshold difference, random error, publication bias, different clinical data (gender and ages, pathological type of nodules), and methodological differences. Cochran's Q test and Higgins' statistics were used to evaluate the heterogeneity of each study. p < 0.05 in the Q test indicated that the study's heterogeneity was more significant than that caused by random error. The I 2 statistic was used to quantitatively describe the percentage of inter-study heterogeneity in the overall difference, and I 2 > 50% suggests heterogeneity and may require further evaluation of the source of heterogeneity [29].

Threshold Effect
The threshold effect refers to the difference in the diagnostic boundaries selected in different diagnostic studies, which may lead to a correlation between the log (sensitivity) and log (1-specificity) of each study result. The correlation was analyzed using Spearman's correlation test. If the Spearman correlation coefficient Journal of Biosciences and Medicines shows a strong positive correlation (p < 0.05), it may indicate a threshold effect.

Statistical Model
Typically, when calculating the effect size, the calculation was performed using random-effects or fixed-effects models according to the heterogeneity. However, the heterogeneity of diagnostic tests is usually greater than that of interventional tests due to patient diversity and environmental differences, so the random-effect model should be used to calculate the effect size [31]. The recommended bivariate mixed-effects regression model was used to calculate the combined sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, SDOR, area under the curve (AUC) and its 95% confidence interval (95% CI) [32].
The diagnostic effectiveness will be calculated using standardized mean difference (SMD) based on the consistency of parameters, measurement methods, and scanning phase.
Meta-regression was used to analyze the influence of SDOR from covariates, including study design type, country of origin, DECT manufacturer, contrast media concentration, contrast media flow rate, and scanning phase. p < 0.05 was selected as the criteria of statistical difference. Softwares used for data analysis and statistics include Stata14 (StataCorp LP, College Station, TX) and Review

Characteristics of Studies and Quality Assessment
The basic information with the diagnostic performance of the included literature is shown (Table 1, diagnostic performance data in Supplementary Table S1).
Seven of the included studies were from China, and one was from South Korea.
Three studies had a prospective design, and the rest were retrospective. Half of the studies were published after 2016 (range 2013-2019). Eight studies recorded the measurement of one or more DECT parameters, among which five listed the IC results, 6 listed the NIC results, and four recorded the λ HU .
In this systematic review and meta-analysis, a total of 595 patients (32% males and 68% females, age 50.64 ± 11.46 years on average) were included, with an  with adenoma (n = 21, 6.17%) and granuloma (n = 3, 0.80%). 23 cases (6.17%) were follow-up cases or unknown pathological results ( Table 2). The method used in the literature quality assessment is based on the QUASAS-2 evaluation tool (Figure 2, Figure 3). Although the included studies had made a meticulous choice in the section of patient selection, there was still an unknown risk of bias due to the variety of the patients with thyroid nodules (e.g., some studies excluding pregnant patients or patients aged under 20). In terms of the reference standard, one study using long-term follow-up results as one of the gold standards may pose a hidden risk of bias in determining thyroid benignity and malignancy [22]. While most studies use cytological results as the gold standard, we believe that the risk of bias is low. Besides, two studies did not mention the interpretation of the index test result without knowledge of the reference standard [37] [38]. The PRISMA-DTA checklist (Table S2, Table S3) are provided in the Supplementary Materials.

Diagnostic Effectiveness
The pooled sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, DOR, and AUC corresponding to the three parameters are shown in Table   3. NIC has a specific threshold effect in the estimation of diagnostic accuracy (p

Prediction of Post-Test Probability
Based on our results, the post-test probability, calculated by using the likelihood ratio of the NIC groups, was shown in the Fagan nomogram [32]. Referring to the currently recorded incidence of malignant thyroid tumors, we assume that the pre-test probability of malignant thyroid nodules is about 10%. It seems that if the test result is positive after the non-invasive DECT examination, the probability of malignancy of thyroid nodules will be increased by about three times from 10% to 31%. If the result is negative, the probability of having malignant thyroid nodules decreases from 10% to 3% (Figure 7).  . Fagan nomogram using the NIC technique to predict the post-test probability of malignant thyroid nodules in patients with thyroid nodules. As shown, a positive result increases the probability from 10% to 31%, while a negative result decreases the probability from 10% to 3%.

Discussion
As far as we know, a few studies have attempted to explore the relationship be-

tween thyroid nodules and diverse quantitative parameters [12] [22] [23] [36]
[39]. One of the previous effective methods using conventional single-energy CT is to evaluate the size of nodules to distinguish benign from malignant thyroid nodules quantitatively. But, the correlation between the size of nodules and the nature of nodules is not ideal [12]. Also, for non-contrast DECT images, Tomita et al. found that it is unfeasible to differentiate between benign and malignant thyroid nodules using material decomposition and attenuation slopes [39]. Interestingly, our meta-analysis found that quantitative DECT parameters have good diagnostic accuracy in evaluating benign and malignant thyroid nodules, especially using contrast-enhanced images.
The collection of IC, NIC, and λ HU are relatively sufficient data at present. The diagnostic ability of these three parameters seems similar. The diagnostic effi- noma, while the follicular structure and iodine-uptake ability was normal in nodular goiters and adenomas [21]. Studies had also demonstrated that the expression of thyrocyte markers like sodium-iodide symporter (NIS) and thyroglobulin (TG) in thyroid cancer were less than those in benign lesions [41] [42] [43]. Among these, the decrease of follicular cells and iodine transporter on the cell membrane might closely correlate with the disfunction of uptake iodine.
Based on our estimation, the NIC values of malignant nodules were significantly lower on contrast-enhanced CT images than that of benign nodules (p < 0.01), which may support the disfunction of iodine uptake in malignant thyroid nodules.
When x-ray passing through a material, its photon energy is absorbed proportionally. And the remaining energy is collected by CT detectors and will be converted to a specific signal, which finally represents the X-ray energy attenuation values on this material, the so-called Hounsfield units (HU) [17] [44]. The spectrum curve is, essentially, the plot of HU against various X-ray energy levels (generally within the range of 40 to 140 keV). In some instances, two materials (e.g., calcium and iodine) can exhibit the same HU due to the similar attenuation coefficients at a given energy level. However, the material's attenuation coefficient depends primarily on the mass density, composition, and the photon energies interacting with the material. After measuring an additional attenuation at a second energy level, the HU's change helps to differentiate between two materials [17]. The slope of spectrum curve (λ HU ) was used to depict the spectrum curve. Previous ex vivo study reported that the λ HU measured in benign nodules (nodular goiters and follicular adenomas) was positive, whereas in malignant nodules (papillary carcinoma) λ HU was negative [21]. But our included studies showed that all of the λ HU were positive in the venous phase, which was inconsistent with the previous ex vivo findings. And the absolute value of λ HU was larger in benign nodules than in malignant nodules according to our analysis.
As a previous study stated, different DECT data obtained by different DECT devices might have some heterogeneity, which requires caution when comparing various studies [45]. Our comprehensive study contains three kinds of DECT Journal of Biosciences and Medicines devices: five dual-energy gemstone spectral CTs, one dual-layer spectral CT, and two second-generation dual-source dual-energy CTs. However, our research found that the heterogeneity of NIC's specificity was not significant, and the heterogeneity of NIC's sensitivity was also acceptable, for it could be explained, to some extent, by the threshold effect. This evidence indicates that the measurement of NIC in the venous phase might be comparable between different CT machines.
Moreover, we did not perform the same estimation in the groups of IC and λ HU due to the uninterpretable heterogeneity. The heterogeneity of IC and λ HU groups may not mainly originate from the threshold effect but from other factors. For example, different imaging phases may cause considerable heterogeneity shown in the mean difference of diagnostic efficacy (I 2 = 73%, I 2 = 95% respectively). By meta-regression analysis of three parameters, however, the known covariates did not show a significant influence on the SDOR. One possibility might be that the heterogeneity is low. But it is more likely that the literature's insufficiency makes it difficult to use the known variables to explain the heterogeneity in existing studies. Nevertheless, we should not ignore that the present analysis does not consider some important variables, such as ages, type of nodules, size of nodules, the timing of the contrast injection, and the trigger threshold of CT scan.

Conclusion
DECT has a good diagnostic accuracy in thyroid nodule evaluation. The normalized iodine quantification technique is comparable between different dual-energy CT machines in thyroid nodule assessment. Quantitative DECT parameter may be a potential tool to estimate thyroid nodules noninvasively, reducing the need for unnecessary examinations of nodules, the associated patient anxiety, and inconvenience. More extensive prospective design studies are necessary to further optimize the diagnostic efficacy of quantitative DECT parameters.

Limitations
This study has some limitations. First, it has been pointed out that the incidence of thyroid cancer is increasing while thyroid nodules may be overdiagnosed, which could lead to overtreatment [6] [45]. We did not focus on whether we should add extra diagnostic methods like DECT to the thyroid examination, though it is still a matter of concern. Second, the thyroid is radiation sensitive.
Presently, there are still some challenges in terms of radiation safety, though the radiation dose could be reduced by up to 50% in DECT [46]. Third, because the application of DECT is still developing and spreading, available literature data are limited, which leads to a relatively small number of studies meeting our inclusion criteria. And we did not estimate the combined diagnostic thresholds due to the heterogeneity among studies. Fourth, the category of thyroid nodules included in the study might need further subdivisions. Although various patho-Journal of Biosciences and Medicines logical types of nodules were collected in the paper, no study proposed to crosslink parameters and the specific types of thyroid nodules (e.g., suspicious malignant nodules in FNA) that might occur progressive events, such as nodules enlargement or tendency for lymph node metastasis. And it is worth noting that it may be challenging to accurately implement iodine value measurement for micro thyroid lesions (maximum diameter < 10 mm), due to the partial volume effect in CT imaging. Of the studies we included, only two studies contained nodules less than 10 mm [36] [37], which implies that the quantitative diagnosis of sub-centimeter nodules still requires further study. Fifth, in terms of data statistics, some data are provided in the form of median and quartile but not mean and stand deviation [36]. So we estimated some data by combining median and quartile with original data size [47] [48], which may bring into errors when estimating the mean and standard deviation.

Supplementary Materials
The following are available online, Full search strategy, Table S1: diagnostic performance of DECT parameters, Table S2: PRISMA-DTA checklist,

Study selection 17
Provide numbers of studies screened, assessed for eligibility, included in the review (and included in meta-analysis, if applicable) with reasons for exclusions at each stage, ideally with a flow diagram. 5, Figure 1 Study characteristics 18 For each included study provide citations and present key characteristics including: a) participant characteristics (presentation, prior testing), b) clinical setting, c) study design, d) target condition definition, e) index test, f) reference standard, g) sample size, h) funding sources 8-9, Table 1, Table 2 Journal of Biosciences and Medicines

Continued
Risk of bias and applicability 19 Present evaluation of risk of bias and concerns regarding applicability for each study. 5-7, Figure 1, Figure 2 Results of individual studies 20 For each analysis in each study (e.g. unique combination of index test, reference standard, and positivity threshold) report 2 × 2 data (TP, FP, FN, TN) with estimates of diagnostic accuracy and confidence intervals, ideally with a forest or receiver operator characteristic (ROC) plot.

Additional analysis 23
Give results of additional analyses, if done (e.g., sensitivity or subgroup analyses, meta-regression; analysis of index test: failure rates, proportion of inconclusive results, adverse events).