Vol.2, No.7, 641-651 (2010)
doi:10.4236/health.2010.27098
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
Health
Statistical models for predicting number of involved
nodes in breast cancer patients
Alok Kumar Dwivedi1*, Sada Nand Dwivedi2, Suryanarayana Deo3, Rakesh Shukla1,
Elizabeth Kopras4
1Center for Biostatistical Services, Department of Environmental Health, College of Medicine, University of Cincinnati, Cincinnati,
USA; *Corresponding Author: alok_bhu1@yahoo.co.in
2Department of Biostatistics, All India Institute of Medical Sciences, New Delhi, India
3Department of Surgical Oncology, All India Institute of Medical Sciences, New Delhi, India
4Department of Environmental Health, College of Medicine, University of Cincinnati, Cincinnati, USA
Received 12 March 2010; revised 8 April 2010; accepted 10 April 2010.
ABSTRACT
Clinicians need to predict the number of invol-
ved nodes in breast cancer patients in order to
ascertain severity, prognosis, and design sub-
sequent treatment. The distribution of involved
nodes often displays over-dispersiona larger
variability than expected. Until now, the nega-
tive binomial model has been used to describe
this distribution assuming that over-dispersion
is only due to unobserved heterogeneity. The
distribution of involved nodes contains a large
proportion of excess zeros (negative nodes),
which can lead to over-dispersion. In this situa-
tion, alternative models may better account for
over-dispersion due to excess zeros. This study
examines data from 1152 patients who under-
went axillary dissections in a tertiary hospital in
India during January 1993-January 2005. We fit
and compare various count models to test
model abilities to predict the number of involved
nodes. We also argue for using zero inflated
models in such populations where all the ex-
cess zeros come from those who have at some
risk of the outcome of interest. The negative
binomial regression model fits the data better
than the Poisson, zero hurdle/inflated Poisson
regression models. However, zero hurdle/inflated
negative binomial regression models predicted
the number of involved nodes much more accu-
rately than the negative binomial model. This
suggests that the number of involved nodes
displays excess variability not only due to un-
observed heterogeneity but also due to excess
negative nodes in the data set. In this analysis,
only skin changes and primary site were asso-
ciated with negative nodes whereas parity, skin
changes, primary site and size of tumor were
associated with a greater number of involved
nodes. In case of near equal performances, the
zero inflated negative binomial model should be
preferred over the hurdle model in describing
the nodal frequency because it provides an es-
timate of negative nodes that are at “high-risk”
of nodal involvement.
Keywords: Nodal Involvement; Count Models;
Breast Cancer
1. INTRODUCTION
Accurate prediction of the number of involved nodes in
breast cancer patients helps in grading severity of dis-
ease, avoid extensive axillary surgery dissections and as-
sists with treatment decisions such as the use of neoadju-
vant chemotherapy [1,2]. Many studies have been per-
formed to predict nodal status in breast cancer patients.
Most of them merely predict the presence/absence of
involved nodes rather than the number of involved nodes
[3]. Until now, only two studies have tried to predict the
number of involved nodes in breast cancer patients.
Guern and Vinh-Hung [3] found that a negative binomial
model describes the number of nodal involvement better
than the Poisson model due to excess variability, a con-
dition called over-dispersion. Another study showed that
the negative binomial model provides a better fit as com-
pared to the Poisson model for the total number of in-
volved nodes in breast cancer patients in a meta-analysis
[4]. These studies used a negative binomial model, whi-
ch posited that the over-dispersion occurred entirely due
to unobserved heterogeneity and/or nodal clustering.
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
642
However, count data often involve over-dispersion not
only due to unobserved heterogeneity and/or clustering
but also due to the preponderance of zero frequency
(negative node in the case of cancer) [5]. Consequently,
the nominal Poisson or the negative binomial distribu-
tions may not satisfactorily account for excess variability
if this variability is indeed due to excess zeros. In such
situations, use of these models may likely underestimate
the probability of negative node status, and may provide
misleading results. Zero hurdle or zero inflated regres-
sion models can be used to increase predictability in
situations with excess zeros.
In count data, the observed zeros can be either struc-
tural zeros (e.g., the subject is at no risk of the event of
interest) or sampling zeros (e.g., the subject is indeed at
some risk of the event of interest). It has been suggested
that zero hurdle models are more appropriate in case of
excessive sampling zeros while zero inflated models
should be preferred in cases of mixtures of zeros i.e.,
involvement of both types of zeros [6]. In breast cancer,
all the patients are indeed at some risk of having nodal
involvement and thus all zeros are strictly sampling ze-
ros. Thus, according to the prevailing wisdom, zero hur-
dle models could be employed to predict the nodal fre-
quency among breast cancer patients.
In epidemiologic studies, generally count data in-
volves zeros at some risk of outcome of interest. In such
circumstances, there exists alternative ways to conceptu-
alize the so-called structural zeros and sampling zeros.
Using the epidemiological parlance, we can conceptual-
ize zeros in terms of disease on-set and disease progres-
sion. In breast cancer patients, a lack of nodal involve-
ment (observed zero) may be because the cancer is de-
tected early enough in the disease progression (closer to
the time of disease onset) or the cancer itself is of slow
progression and/or absence of risk factors for high rate
of disease progression. These kinds of zeros may be
identified as true or structural zeros. The rest of the zeros
may be observed in the presence of various risk factors
leading up to a high rate of disease progression. These
latter types of zeros can be identified as false or sam-
pling zeros. Thus, within the framework of zero inflated
models, excess zeros can be modeled as a mixture of
true zeros and false zeros. Note that the false zeros can
also arise either due to chance, false recording and/or
due to false observation. It has been reported that some
of the involved (positive) nodes may be recorded as
negative due to misclassification by the pathologist (re-
ferred to as reporting error) [7]. One study reported that
non-dissection of complete axillary lymph nodes might
provide false negative nodes [8]. These false negative
nodes may be more likely to be found among patients
with a high risk of nodal involvement. This indicates a
need of estimation of false negative nodes so that they
can follow up or be reassessed for diagnostic accuracy.
In these situations, we suggest use of the zero inflated
models, not only to account for excess zeros, but also to
estimate the proportion of false zeros or patients with
zeros at high risk of nodal positivity.
Significant applications of zero hurdle and zero in-
flated models have been made in various fields of re-
search [9-11]. In recent years, the application of these
models and their comparisons with other count models
has also increased in medical and health fields [12-19]. A
review of the application of such models in health re-
search is also reported [20]. Extensions of these models
for describing correlated data have also been reported
[21-24]. These studies illustrate that zero hurdle/inflated
models should be used if over-dispersion in the data is
due to excess zeros. Results also indicate that zero hur-
dle models should be preferred if only at-risk zeros are
present in the population. However, to our knowledge,
the relative performance of zero hurdle and inflated
models in predicting the number of involved nodes has
not been addressed. In this paper, prediction of the
number of involved nodes is made using Poisson regres-
sion (PR), negative binomial (NB), zero hurdle Poisson
(ZHP), zero inflated Poisson (ZIP), zero hurdle negative
binomial (ZHNB) and zero inflated negative binomial
(ZINB) models. Zero hurdle models in many epidemi-
ologic studies like the present one may satisfactorily
account for excess zeros, perhaps even as good as zero
inflated models. We arguably demonstrate that the zero
inflated models have an added advantage over the for-
mer in describing the event of interest in relation to the
disease process itself, including identification of the
factors involved in predicting the disease onset and dis-
ease progression.
2. MATERIALS AND METHODS
2.1. Subjects
We utilized one of the largest breast cancer datasets
available in India to assess the number of involved nodes
distribution. The data were extracted from the comput-
erized database of breast cancer patients maintained at
the Department of Surgical Oncology, Institute Rotary
Cancer Hospital (IRCH), All India Institute of Medical
Sciences (AIIMS), New Delhi, India, a tertiary care cen-
ter, during the period from January 1993 to January 2005.
The dataset was updated using the original records kept
in the record section of IRCH. Data from all patients
who underwent surgery for breast cancer, including axil-
lary lymph node dissections, were included in this study.
Patients with recurrent breast cancer, bilateral breast
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
643
643
carcinoma, any evidence of metastasis, unknown primary
site and male breast carcinoma were excluded from the
study.
Covariates and their forms were chosen based on
breast cancer literature and an exploratory analysis of
this dataset.Patients’ age at presentation was stratified as
younger (below 35 years) and elder (more than or equal
to 35 years). Duration from onset of symptoms until
presentation was classified as less than or equal to 2, 2-4,
4-8 and more than 8 months. Parity was categorized as
nulliparous, single/doubleparous, and multiparous. Other
covariates included menopausal status (post/pre); family
history of breast cancer (absent/present); primary side
(left/right); skin changes (no/yes); neoadjuvant chemo-
therapy (no/yes); primary site {medial (lower inner
quadrant and upper inner quadrant)/lateral (lower outer
quadrant and upper outer quadrant)/central (multiple,
central and others)}; tumor type (infiltrating ductal car-
cinoma/infiltrating lobular carcinoma and others); and
pathological tumor size was according to TNM classifi-
cation (< = 2/2-5/> 5cm). The neoadjuvant chemother-
apy and total number of dissected nodes were only used
in the model for adjustment, because these variables are
highly associated with involved nodes. The study popu-
lation consisted of all cases of breast cancer and the
outcome in question was the number of involved nodes
in a patient. Patients with negative nodes (zeros) were
divided into two groups-those with “at low risk” of
nodal involvement and those with “at high risk” of nodal
involvement. A patient with negative nodes and having a
relatively low risk of nodal involvement was defined as
“at low risk” zero and labeled, in the context of model-
ing, as a “true or structural” zero. The remaining patients
with negative nodes and a relatively high risk of nodal
involvement due to the presence of various risk factors
were defined as “at high risk” zeros. In the context of
modeling, we label them as “false or sampling” zeros.
2.2. Statistical Models
The Poisson regression model (PR) describes count out-
comes or proportion/rates. Generally, the PR model ex-
plains less variability of counts than the observed vari-
ability. As a result, this often gives misleading relation-
ships between covariates and outcomes. Excess variabil-
ity can be adjusted within the PR framework using infla-
tion approaches of standard errors of the regression co-
efficients [25]. As such, it may be the appropriate model
to use for drawing correct inferences in the case of
over-dispersion due to unobserved heterogeneity and/or
clustering/temporal dependency. However, it may not be
the most appropriate in the case of excess zeros, as ex-
pected in assessing the distribution of number of in-
volved nodes. In the PR model, yi is the number of in-
volved nodes for the ith patient, and λi is the mean num-
ber of involved nodes. If the number of involved nodes
follows a Poisson distribution, its probability mass func-
tion can be expressed as:

ii
λy
i
ii ii
i
eλ
fy|x,y0,1,2,i1,2,....,0
y! n
 (1)
If i’s are regression coefficients corresponding to the set
of considered covariates xi’s, and k is the number of
considered covariates, then the PR model can be ex-
pressed using Eq.1 as:
i01122 k
log λββxβxβx k
(2)
As an alternative to the PR model, the negative bino-
mial (NB) model has an inbuilt provision to account for
over-dispersion due to unobserved heterogeneity and/or
temporal dependency [26]. As a result, this model helps
not only in adjusting the standard errors of the regression
coefficients but also provides a more flexible approach
for prediction of the count outcome. Under the assump-
tion of over-dispersion being merely due to unobserved
heterogeneity and/or temporal dependency, the NB model
was used. The unobserved heterogeneity may be due to
unobserved predictors and/or too much variation in some
of the clinical and pathological cofactors. Temporal de-
pendency in nodes may be occurring due to clustering of
nodal involvement within patients. The NB model is
expressed as:

i
1/αy
-1 -1
ii
ii -1 -1-1
ii
i
Γ(y α)α
fy|x ,
Γ(y 1)Γ(α)αα
y0,1, 2.....;i=1, 2.....n;





i
(3)
In this model, is the over-dispersion parameter due
to unobserved heterogeneity and λi is the mean number
of involved nodes. The NB regression model can be ob-
tained similar to Eq.2 by using Eq.3.
The NB model may not be appropriate if the over-
dispersion is due to excess zeros because it underesti-
mates the probability of zeros and consequently underes-
timates the variability present in the outcome. In such
situations, alternative models such as zero inflated/hur-
dle models that account for over-dispersion due to ex-
cess zeros are useful.
Zero hurdle models are typically used when the excess
zeros arise from an “at risk” population. Under the as-
sumption that over-dispersion results from excess zeros
arising from an “at risk” group, zero hurdle Poisson
(ZHP) was used. In this model, all zeros are considered
to be observed from a non-counting process, as opposed
to a counting process. Within this model, all zeros are
typically described through logistic regression, whereas
positive counts are described through a zero truncated
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. http://www.scirp.org/journal/HEALTH/
644
mates the relative proportion of these at “low risk” and
at “high risk” zeros. Further, this can be used to identify
subjects with a high likelihood of being in one or the
other type of zero classification using the risk factors. In
zero inflated models, occurrence of zeros is considered
as a result of two distinct processes. Some of the zeros
(zeros at “high risk”) are considered to be observed from
counting process and others (zeros at “low risk”) from
non-counting process. As an inbuilt mechanism within
these models, true zeros are typically described through
logistic regression, whereas false zeros are described
through simple count model. Like hurdle models, the
zero inflated models also provide two sets of results.
However, the interpretation of regression coefficients
under inflated models is different from the hurdle mod-
els. Modeling binary process provides factors associated
with negative nodes in a “low risk” population as com-
pared to a “high risk” population, whereas modeling
count process provides factors associated with the extent
of the number of involved nodes, including false nega-
tive nodes given that patients are in a high risk popula-
tion. Here, the probability of observing negative nodes is
the sum of observing negative nodes (true) under the
logistic model plus the probability that a individual is
not in the binary process, and the probability that nega-
tive nodes (false) under the considered count model. If
the count process follows the Poisson distribution then it
is called a zero inflated Poisson (ZIP) model. To under-
stand the ZIP model, consider the occurrence of at “low
risk” negative nodes with probability pi under a logistic
model, whereas that of involved nodes (including at
“high risk” false negative nodes) with probability (1-pi)
under the Poisson model, having a mean number of in-
volved nodes (λi,), the ZIP distribution can be expressed
[28] as:
Poisson model. In the ZHP model, pi is “at risk” negative
nodes under logistic model. Assuming the mean num-
ber of involved nodes (λi) under zero truncated Poisson
model, the ZHP distribution may be expressed [27] as:
If γi’s and i’s are respective regression coefficients
under logistic and zero truncated Poisson models corre-
sponding to considered covariates (xi’s), and the number
of considered covariates is k in each of the models, then
using Eq.4 regression models can be expressed as:

i
01122k
i
i01122 kk
p
log γγxγxγx
1-p
log λββxβxβx

 


 
k
(5)
The ZHP model provides two sets of results. These
results can also be obtained separately by fitting both a
logistic regression and zero truncated Poisson model.
This is why hurdle models are referred to as two-part
models. The binary process model identifies factors as-
sociated with the presence/absence of nodal involvement,
whereas modeling count process yields factors associ-
ated with an increase in the number of involved nodes
given that the patient has involved nodes. Note that the
ZHP model accounts for over-dispersion due to excess
zeros but not due to unobserved heterogeneity and/or
temporal dependency in nodal involvement. In the latter
case, one may use the zero hurdle negative binomial
(ZHNB) model by considering count process as zero
truncated negative binomial distribution. Substituting a
zero truncated negative binomial distribution in Eq.4
yields the ZHNB distribution, and it can be expressed as
Eq.6.
Zero inflated models are typically used when the ex-
cess zeros are a mixture of two types of zeros-true
(structural zeros) and false (sampling zeros). We propose
to categorize the negative nodes in our population as a
mixture of two types, those with very low/no risk of
nodal involvement (true zeros) and those with high risk
of nodal involvement (false zeros). In this way, use of
the zero inflated model framework not only accounts for
the extra variability due to excess zeros but also esti-





i
ii ii
y
ii ii
iii
i
p1pexpλ,y =0
fy|x exp λλ
1p ,y1;0p1;λ0
Γy
 
i
 
(7)

 


i
ii
y
ii
ii
iiii
ii
p , y0
exp λλ
fy|x 1p, y1; 0p1; λ0; i1, 2,,n
y! 1expλ


(4)





i
i i
1/αy
-1 -1
ii
i-1 -1
ii 1/α
-1 ii
-1
i
-1
i
p , y0
Γyαα
1p
fy|x αα
α
1Γy1Γα
α


 
 








i
, y1
(6)
Openly accessible at
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
645
645
If γi
s and i
s are respective regression coefficients
under logistic and Poisson models corresponding to con-
sidered covariates (xi’s), and the number of considered
covariates is k in each of the models, then using Eq.7,
regression models can be expressed as:

i
01122k
i
i01122 kk
p
log γγxγxγx
1-p
log λββxβxβx

 


 
k
(8)
If the count process does not follow the Poisson
model then one may use the zero inflated negative bi-
nomial (ZINB) model by considering count process as a
negative binomial distribution. In contrast to ZIP, the
ZINB model accounts for the over-dispersion due to
both types of zeros as well as due to unobserved hetero-
geneity and/or temporal dependency. Substituting nega-
tive binomial distribution in Eq.7, the ZINB distribution
can be expressed as:
2.3. Model Comparisons
The PR, NB, ZIP, ZHP, ZHNB and ZINB models were
used to describe the number of involved nodes in breast
cancer patients. The covariates found to be significant in
univariate analysis with any of the regressions were in-
cluded into all the regression models to maintain the
comparative findings. The nested models (e.g., PR ver-
sus NB and ZIP, NB versus ZINB, and ZHP versus
ZHNB) were compared using a likelihood ratio. Signifi-
cant result of the likelihood ratio test of comparison (PR
versus NB, NB versus ZINB, and ZHP versus ZHNB)
indicates the presence of over-dispersion due to hetero-
geneity and/or temporal dependency. The non-nested
models (PR with ZHP, PR with ZHNB, PR with ZINB,
NB with ZHP, NB with ZIP, NB with ZHNB, ZHP with
ZIP, ZHP with ZINB and ZHNB with ZINB) as well as
nested models were also compared using the Vuong test
[29]. Significant and better fit of comparisons (PR with
ZHP/ZIP, and NB with ZHNB/ZINB) explores whether
or not the over-dispersion is due to excess zeros.
To compare the predictive performance of the models,
various indices such as log likelihood, Akaike Informa-
tion Criterion (AIC), Bayesian Information Criterion
(BIC), mean squared prediction error (MSPE) and mean
absolute prediction error (MAPE) were also obtained. A
probability plot (observed probability minus predicted
probability of positive nodes versus number of positive
nodes) was constructed for each model. The probability
plot was constructed after truncation at 10 positive nodes
for ease of visual comparison. The best-fitted model was
also validated using the leave-one-out cross validation
method [30]. The p-values less than 5% were considered
as significant results. STATA 9.0 package was used for
all statistical analyses.
3. RESULTS
A total of 1152 patients were found to be eligible for this
study. Of those in the study, the presence of involved
nodes was found in 705 (61.2%) patients. The mean and
standard deviation of the number of involved nodes per
patient were 3.9 and 5.6 respectively (median 1 and
range: 0-33). Median number of total dissected nodes
per patient was 14 (range: 1-46). The mean age was 47.7
(standard deviation, 11.1) years and range 20-86 years.
The distributions of covariates considered in the analysis
are shown in Table 1.
A descriptive comparison reveals that the cofactors
parity, skin changes, primary site and pathological tumor
size were consistently associated with outcome across all
models. Three additional covariates, age, menopausal
status and tumor type, were statistically significant only
in the PR model. There was good concordance in the
assessment of statistical significance in all aspects among
ZHP, ZIP and NB models. A similar relation could also
be seen between the ZINB and ZHNB models in pro-
viding factors associated with the extent of nodal in-
volvement. In other words, parity, skin changes, primary
site and tumor size were found associated with a greater
number of involved nodes in both models. However, the
ZHNB model provided primary site, skin changes and
pathological tumor size associated with presence of
positive nodes whereas ZINB model provided only pri-
mary site and skin changes associated with presence of
positive nodes in at high-risk population.
The significant Pearson chi square goodness of fit (gof)
test (p < 0.001) along with other characteristics of model
fit indicated that the PR model produced a poor fit for
nodal involvement data. In the NB model, the estimated
dispersion statistic (α) was 1.73 (95% CI: 1.54, 1.95). A
significant likelihood ratio test (p < 0.001) of dispersion






i
1/α
1
ii i
-1
i
ii 1/αy
-1 -1
ii
ii
-1 -1
-1
ii
i
α
p1p , y0
α
py|x Γyαα
1p , y1
αα
Γy1Γα

 






 

(9)
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
646
Table 1. Zero inflated negative binomial model for number of involved nodes.
Variables N
Logistic Portion*
Odds Ratio (95% CI) NB Portion
Risk Ratio (95% CI)
Age (year)
> 35 977 1.00 1.00
< = 35 175 0.98 (0.54, 1.80) 1.12 (0.90, 1.38)
Symptom duration (month)
< = 2 376 1.00 1.00
3-4 263 0.74 (0.43, 1.26) 1.00 (0.82, 1.23)
5-8 266 1.13 (0.71, 1.81) 1.17 (0.95, 1.43)
> = 9 247 0.73 (0.43, 1.24) 1.08 (0.88, 1.33)
Parity
Nulliparous 47 1.00 1.00
P1/P2 445 1.18 (0.26, 5.31) 1.82 (1.20, 2.77)
Multiparous 660 1.67 (0.38, 7.44) 1.95 (1.29, 2.95)
Menopausal
Post Menopausal 587 1.00 1.00
Pre Menopausal 565 0.69 ( 0.45, 1.04) 1.01 (0.85, 1.18)
Primary side
Left 583 1.00 1.00
Right 569 0.87 (0.60, 1.26) 0.91 ( 0.79, 1.06)
Primary site
Medial (UIQ + LIQ) 235 1.00 1.00
Lateral (LOQ + UOQ) 681 0.62 (0.40, 0.96) 1.29 (1.05, 1.60)
Central/Multiple/Other 236 0.38 (0.19, 0.74) 1.24 (0.97, 1.58)
Skin changes
No 746 1.00 1.00
Yes 406 0.38 ( 0.23, 0.62) 1.40 (1.19, 1.66)
Tumor type
Other/ILC 78 1.00 1.00
IDC 1074 0.62 (0.31, 1.22) 1.14 (0.82, 1.57)
Tumor size (centimeter)
< = 2 236 1.00 1.00
2-5 666 0.63 (0.40, 1.01) 1.28 (1.03, 1.59)
> 5 250 0.61 (0.34, 1.09) 1.49 (1.17, 1.91)
*The odds ratio of negative nodes in low risk group
All the results are adjusted in relation to neoadjuvant chemotherapy as well as total number of dissected nodes
statistic from zero favored the NB model over the PR
model. Recall that more than one third of the patients
had negative nodes, indicating an excess of negative
nodes. Intuitively, this suggests that over-dispersion is
most likely due to excess negative nodes. Firstly, all
negative nodes were considered to arise from an at-risk
group, justifying use of the ZHP model. Further, to esti-
mate false negative nodes, it was considered that some
of these negative nodes might be observed among pa-
tients who had a “low risk” of nodal positivity (true ze-
ros) and some proportion might be observed among pa-
tients who had “high risk” of nodal involvement (false
zeros). With this more natural consideration, the ZIP
model was used. Both the Vuong test (V = 12.60 and p <
= 0.001) and the significant likelihood ratio test favored
the ZHP model over the PR model. However, the com-
parison of ZHP and ZIP using Vuong test (V = 2.01 and
p = 0.04) slightly favored the ZIP model. The results of
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
647
647
Vuong tests also favored the NB model over the ZHP
model (8.86, p = < 0.001) and the ZIP model (8.84, p <
0.001). As observed through improved fit of the NB
model over PR and ZHP/ZIP models, it clearly indicates
that over-dispersion is involved due to unobserved het-
erogeneity and/or clustering. In addition, ZHP/ZIP pro-
vided evidence of over-dispersion due to excess negative
nodes, in comparison to the PR model. Hence, a model
incorporating over-dispersion due to excess negative
nodes as well as unobserved heterogeneity simultane-
ously was expected to provide improved predictability of
number of involved nodes. Accordingly, ZHNB and
ZINB models were used to predict number of involved
nodes. Under ZHNB and ZINB models, the estimated
dispersion parameters of zero truncated negative bino-
mial and NB models were observed different than zero
as [(α = 0.70; 95% CI: (0.56, 0.87)] and [(α = 0.71; 95%
CI: (0.57, 0.89)] respectively. This suggests that ZHNB/
ZINB models are more appropriate than ZHP/ZIP mod-
els in describing the number of involved nodes. The bet-
ter fit of ZHNB/ZINB models over the NB model sug-
gests that over-dispersion is not only due to excessive
negative nodes but also due to unobserved heterogeneity
and/or clustering. The result of the Vuong test showed
no difference between ZHNB and ZINB models in pre-
dicting nodal frequency (1.53, p = 0.13).
The model fit characteristics are shown in Table 2.
The minimum BIC was observed for the NB model, fol-
lowed by ZHNB/ZINB models. However, other validity
indices of the model (maximum log likelihood, mini-
mum AIC, MSPE and MAPE) favored ZHNB/ZINB
models over all other models. The plot of observed mi-
nus predicted probability of involved nodes at each
count is shown in Figure 1. The PR model underesti-
mates probability of occurrence of negative node and
overestimates occurrence of one positive node. The line
of difference between observed minus predicted prob-
ability of positive nodes was close to the reference zero
line, showing better fit of ZHNB/ZINB models than the
other models. There is virtually no difference between
ZHNB and ZINB models in all aspects of describing the
number of involved nodes. The ZINB model provides
slightly smaller validity indices as compared to ZHNB.
Finally, the ZINB model was assessed by the leave one
out cross validation method. The MSPE in cross valida-
tion of the ZINB model was the lowest of all the models
(0.0007), indicating that the ZINB model performs well
for predicting nodal involvement in future patients. The
ZINB model predicts that 70.6% all negative nodes are
at “low risk” zeros, and the remaining 29.4% are at
“high risk” for negative nodes. This indicates that almost
30% of the patients observed as negative for nodal in-
volvement are at “high risk” of nodal involvement based
on cofactors.
Table 1 displays the estimates of regression coeffi-
cients for various cofactors of both portions of the ZINB
model. For ZINB, the results of both parts of the models
together help in understanding the role of the factors on
nodal distribution. The logistic portion showed that me-
dial primary site and absence of skin changes signifi-
cantly increased the chance of negative nodes in breast
cancer patients. Negative binomial portion reveals that
the risk of a greater number of involved nodes was 82
percent higher in single/doubleparous patients versus
nulliparous patients, given that the patients are in a high-
risk group. Further, this was 95 percent higher among
multiparous patients. The patients with lateral site in-
volvement had 1.29 times higher likelihood for having a
larger number of positive nodes than patients with the
medial site. Women with skin changes had 1.39 times
more involvement of higher positive nodes as compared
Figure 1. Plots of observed minus predicted probability of
positive nodes versus number of positive nodes for six models.
Table 2. Comparison of model fit characteristics.
PR NB ZHP ZIP ZHNB ZINB
Log Likelihood –4093.9 –2598.6 –3019.7 –3018.4 –2553.7 –2551.1
AIC 8221.8 5233.1 6107.4 6104.8 5185.4 5172.2
BIC 8307.6 5324.0 6279.0 6276.5 5382.3 5348.9
MSPE 4764.0 139.1 632.5 627.62 52.9 49.2
MAPE 27.5 6.2 13.1 13.0 4.8 4.7
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
648
to their counterparts. The chance of increased positive
nodes was 28 percent higher among patients with 2-5 cm
tumor size, in comparison to patients with less than 2 cm
tumor size. It was again 1.49 times more likely among
patients with more than 5 cm tumor size as compared to
less than 2 cm tumor size.
4. DISCUSSION
The number of involved nodes is one of the most impor-
tant therapeutic and prognostic factors for breast cancer
[1]. Clinicians need to predict the number of involved
nodes in breast cancer patients in order to improve health
outcomes. To the best of our knowledge, few studies
have described the number of involved nodes in breast
cancer patients, and tested statistical models to accu-
rately predict involved node number. As for most of the
count data, studies also found excess variability in nodal
distribution than that expected by a Poisson model. They
also generally assume the cause of over-dispersion to be
solely due to unobserved heterogeneity, and therefore
used the NB model to fit and describe nodal frequency
[3,4]. However, data with nodal involvement often in-
volve excess zeros, which also cause over-dispersion.
This indicates a need to explore fitting zero hurdle and
zero inflated models, which can also account for vari-
ability due to excessive zeros. In the current paper, we
fitted various count models to identify putative causes of
over-dispersion, and to assess the predictive performance
of these models with regard to the nodal status in a
population of patients with breast cancer. We also illus-
trated the significance of using zero inflated models in
count data involving zeros that emanate from the sub-
jects that are all “at-risk” of the event of interest.
The ZHNB/ZINB regression models provide the best
fit when predicting the number of involved nodes in
breast cancer patients. This confirms that the distribution
of the involved nodes contained over-dispersion not only
due to unobserved heterogeneity but also due to exces-
sive negative nodes (zeros). As expected, the PR model
had the worst prediction ability for nodal frequency.
Accounting only one source of over-dispersion, either
due to excessive zeros or due to unobserved heterogene-
ity, the prediction ability of nodal frequency improved as
indicated by NB, ZHP, ZIP models. However, use of
ZHNB/ZINB models, which assumes involvement of
more than just one source of over-dispersion, provided
smaller prediction error.
The ZHNB and ZINB models were consistent and
similar for factor-identification in the extent of nodal
involvement as well as for prediction of number of posi-
tive (involved) nodes. In the current study, we focused
on predicting nodal frequency. On that basis, either model
can be used to predict number of involved nodes. Due to
ease of interpreting the results of ZHNB model, it can be
preferred over ZINB model. These findings are sup-
ported by Rose et al. [6], who also found good concor-
dance between the ZHNB and ZINB models on vaccine
adverse data—a case of only “at risk” zeros similar to
the data used in our study. They suggested that the model
selection should be determined based on study objec-
tives and the data generating process. They recommend
using the ZHNB model due to involvement of only “at
risk” zeros. However, Baughman [31] suggested that
model choice should be based on the rationale behind
the consideration of data generating mechanism. Gilt-
horpe et al. [32] suggested that the zero inflated models
should be used according to the underlying disease
process i.e., considerations of disease onset and disease
progression. In our opinion, zero hurdle models should
be preferred if data consist of zeros which are all coming
from the subjects at “no-risk” of the outcome of interest,
and over-dispersion is due to excess zeros. In such cases,
zeros from the “no-risk” population arise from a non-
counting process. However, zeros coming from an “at
risk” population belong to the count process, thus influ-
encing model choice based on the rationale behind the
data generation of the “at risk” population. In the present
study, if diagnosis is close to or at disease onset, the risk
of finding the event of interest (nodal involvement)
would be minimal, whereas if the diagnosis is late and
during disease progression, the risk of the event of inter-
est would be relatively high. Previous studies note that
the distribution of involved nodes often consists of some
proportion of false negative nodes, which may often
arise in the “high-risk” group [7,8]. There is ample evi-
dence to consider “at risk” zeros, at least in breast cancer,
as a mixture of “low-risk” and “high-risk” zeros, thus,
suggesting the use of zero inflated models. Use of the
ZINB model not only gives estimate of the false nega-
tive nodes i.e., zero at “high risk” of nodal involvement,
but also provides slightly better predictive performance
than the ZHNB model.
The ZINB model estimated about 30 percent of the
zeros that can be considered false/at “high risk” negative
nodes, suggesting that these patients are at high risk of
nodal involvement. Among these, some patients might
have been observed or reported falsely as having nega-
tive nodes. If so, then those patients might have been
under-treated and/or misclassified, resulting in an inac-
curate predicted prognosis. This model will help to iden-
tify such patients, and reduce misclassification. There is
a need to develop a sound strategy to classify patients at
“high risk” zeros and “low risk” zeros. This issue is un-
der investigation by us, and is the subject of a future
publication.
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
649
649
The mean square prediction error was found to be
35.4% less using ZINB as compared to the NB regres-
sion model. In addition, the predictive performance of
the ZINB model was significantly better than the NB
regression model, indicating that the NB model may not
always be appropriate for describing nodal distribution.
The leave-one-out cross-validation assessment of the
developed ZINB model provided the minimum mean
square prediction error compared to the other developed
models, indicating that the model performs well, even
for future patients, in comparison to other models.
This study is the first report to analyze patterns of
nodal involvement in breast cancer, using a large dataset
collected in India. In our study, 61.2% of the patients
had the presence of involved nodes. Sandhu et al., using
a different Indian dataset, also reported a 61.6% nodal
involvement [33]. A different study, also using a popula-
tion from India, reported an even higher nodal positivity
rate of 80.2% [34]. In our study, both presence of other
than medial primary site and skin changes among pa-
tients are associated with high risk of nodal involvement
and with a greater number of involved nodes. In addition
to these two factors, higher parity and larger tumor size
are also associated with an increased risk of a higher
number of involved nodes, given that the patients are in
high risk population. These factors are consistently found
to be associated with the presence of involved nodes in
other studies [35-41], and are directly or indirectly con-
sequences of late diagnosis. Overall, these findings con-
firm the need for ongoing efforts to minimize diagnostic
delay in patients suspected of having breast cancer.
One limitation to our study is that it uses a dataset not
designed for our analysis. Important covariates, such as
lymphatic vascular invasion and S-phase function, were
not included in this database. These covariates could be
significantly associated with involved nodes, as reported
in various studies [42-45]. In addition, instead of ad-
justment of these results in relation to dissected number
of nodes, an attempt could be made to model the propor-
tion of positive nodes in patients through count data
models or binomial models.
5. CONCLUSIONS
The ZHNB/ZINB regression models can be used to de-
scribe nodal distribution more appropriately than the NB
model. However, the ability of the ZINB model to more
accurately estimate at “high-risk” zeros while having a
comparatively lower prediction error, as compared to the
ZHNB model, suggests that it is the best model for pre-
dicting and describing the number of involved nodes.
Many of the factors associated with nodal involvement
may be a result of diagnostic delay of breast cancer pa-
tients, indicating the need to minimize delay in diagnosis
of breast cancer patients. There is also a need to further
investigate the consequences of using zero inflated mod-
els, as an alternative to zero hurdle models, in at- risk
populations.
6. ACKNOWLEDGEMENTS
The authors would like to express their thanks to Dr. V. Sreenivas,
Department of Biostatistics, All India Institute of Medical Sciences,
New Delhi; Dr. Arvind Pandey, National Institute of Medical Statistics,
New Delhi; and also Dr. Kishore Chaudhry and Dr. D. K. Shukla,
Division of Non-Communicable Diseases, Indian Council of Medical
Research, New Delhi, for their critical comments throughout this study.
REFERENCES
[1] Hernandez-Avila, C.A., Song, C., Kuo, L., Tennen, H.,
Armeli, S. and Kranzler, H.R. (2006) Targeted versus
daily naltrexone: Secondary analysis of effects on aver-
age daily drinking. Alcoholism, Clinical and Experimen-
tal Research, 30(5), 860-865.
[2] Slymen, D.J., Ayala, G.X., Arredondo, E.M. and Elder,
J.P. (2006) A demonstration of modeling count data with
an application to physical activity. Epidemiologic Per-
spectives & Innovations, 3(3), 1-9.
[3] Horton, N.J., Kim, E. and Saitz, R. (2007) A cautionary
note regarding count models of alcohol consumption in
randomized controlled trials. BioMed Central Medical
Research Methodology, 7(9), 1-9.
[4] Salinas-Rodriguez, A., Manrique-Espinoza, B. and Sosa-
Rubi, S.G. (2009) Statistical analysis for count data: Use
of health services applications. Salud Publica Mex, 51(5),
397-406.
[5] Asada, Y. and Kephart, G. (2007) Equity in health ser-
vices use and intensity of use in Canada. Biomed Central
He a lth Services Research, 7(41), 1-12.
[6] Grootendorst, P.V. (1995) A comparison of alternative
models of prescription drug utilization. Health Econom-
ics, 4(3), 183-198.
[7] Afifi, A.A., Kotlerman, J.B., Ettner, S.L. and Cowan, M.
(2007) Methods for improving regression analysis for
skewed continuous or counted responses. Annual Review
of Public Health, 28, 95-111.
[8] Hur, K., Hedeker, D., Henderson, W., Khuri, S. and
Daley, J. (2002) Modeling clustered count data with ex-
cess zeros in health care outcomes research. Health Ser-
vices and Outcomes Research Methodology, 2002, 3,
5-20.
[9] Lee, A.H., Wang, K., Scott, J.A., Yau, K.K. and McLach-
lan, G.J. (2006) Multi-level zero-inflated Poisson regres-
sion modeling of correlated count data with excess zeros.
Statistical Methods in Medical Research, 15(1), 47-61.
[10] Yau, K.K. and Lee, A.H. (2001) Zero-inflated Poisson
regression with random effects to evaluate an occupa-
tional injury prevention programme. Statistics in Medi-
cine, 20 (19), 2907-2920.
[11] Min, Y. and Agresti, A. (2005) Random effect models for
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. Openly accessible at http://www.scirp.org/journal/HEALTH/
650
repeated measures of zero-inflated count data. Statistical
Modelling, 5(1), 1-19.
[12] Gardner, W., Mulvey, E.P. and Shaw, E.C. (1995) Re-
gression analyses of counts and rates: Poisson, overdis-
persed Poisson, and negative binomial models. Psycho-
logical Bulletin, 118(3), 392-404.
[13] Hardin, J.W. and Hilbe, J.M. (2007) Generalized Linear
Models and Extensions. A Stata Press Publication, Stat-
Corp LP, Texas.
[14] Mullay, J. (1986) Specifications and testing of some
modified count data model. Journal of Econometrics,
33(3), 341-365.
[15] Lambert, D. (1992) Zero-inflated Poisson regression,
with application to defects in manufacturing. Technomet-
rics, 34(1), 1-14.
[16] Vuong, Q.H. (1989) Likelihood ratio tests for model
selection and non-nested hypotheses. Econometrica, 57
(2), 307-333.
[17] Picard, R. and Cook, D. (1984) Cross-Validation of Re-
gression Models. Journal of the American Statistical As-
sociation, 79(387), 575-583.
[18] Baughman, L.A. (2007) Mixture model framework fa-
cilitates understanding of zero-inflated and hurdle mod-
els for count data. Journal of Biopharmaceutical Statis-
tics, 17(5), 943-946.
[19] Gilthorpe, M.S., Frydenberg, M., Cheng, Y. and Baelum,
V. (2009) Modelling count data with excessive zeros: The
need for class prediction in zero-inflated models and the
issue of data generation in choosing between zero-in-
flated and generic mixture models for dental caries data.
Statistics in Medicine, 28(28), 3539-3553.
[20] Sandhu, D.S., Sandhu, S., Karwasra, R.K. and Marwah,
S. (2010) Profile of breast cancer patients at a tertiary
care hospital in north India. Indian Journal of Cancer,
47(1), 16-22.
[21] Saxena, S., Rekhi, B., Bansal, A., Bagga, A., Chintamani
and Murthy, N.S. (2005) Clinico-morphological patterns
of breast cancer including family history in a New Delhi
hospital, India-A cross-sectional study. World Journal of
Surgical Oncology, 3, 67-75.
[22] Nouh, M.A., Ismail, H., Ali El-Din, N.H. and El-Bol-
kainy, M.N. (2004) Lymph node metastasis in breast car-
cinoma: Clinicopathologic correlations in 3747 patients.
Journal of Egyptian National Cancer Institute, 16(1),
50-56.
[23] Gann, P.H., Colilla, S.A., Gapstur, S.M., Winchester, D.J.
and Winchester, D.P. (1999) Factors associated with axil-
lary lymph node metastasis from breast carcinoma de-
scriptive and predictive analyses. Cancer, 86(8), 1511-
1518.
[24] Olivotto, I.A., Jackson, J.S.H., Mates, D., Andersen, S.,
Davidson, W., Bryce, C.J. and Ragaz, J. (1998) Predic-
tion of axillary lymph node involvement of women with
invasive breast carcinoma a multivariate analysis. Cancer,
83(5), 948-955.
[25] Ravdin, P.M., De Laurentiis, M., Vendely, T. and Clark,
G.M. (1994) Prediction of axillary lymph node status in
breast cancer patients by use of prognostic indicators.
Journal of National Cancer Institute, 86(23), 1771-1775.
[26] Chua, B., Ung, O., Taylor, R. and Boyages, J. (2001) Fre-
quency and predictors of axillary lymph node metastases
in invasive breast cancer. Australian and New Zealand
Journal of Surgery, 71(12), 723-728.
[27] Manjer, J., Balldina, G. and Garne, J.P. (2004) Tumour
location and axillary lymph node involvement in breast
cancer: A series of 3472 cases from Sweden. European
Journal of Surgical Oncology, 30(6), 610-617.
[28] Manjer, J., Balldin, G., Zackrisson, S. and Garne, J.P.
(2005) Parity in relation to risk of axillary lymph node
involvement in women with breast cancer. European
Surgical Research, 37(3), 179-184.
[29] Olivotto, I.A., Jackson, J.S.H., Mates, D., Andersen, S.,
Davidson, W., Bryce, C.J. and Ragaz, J. (1998) Predic-
tion of axillary lymph node involvement of women with
invasive breast carcinoma a multivariate analysis. Cancer,
83(5), 948-955.
[30] Ravdin, P.M., De Laurentiis, M., Vendely, T. and Clark,
G.M. (1994) Prediction of axillary lymph node status in
breast cancer patients by use of prognostic indicators.
Journal of National Cancer Institute, 86(23), 1771-1775.
[31] Chua, B., Ung, O., Taylor, R. and Boyages, J. (2001)
Frequency and predictors of axillary lymph node metas-
tases in invasive breast cancer. Australian and New Zea-
land Journal of Surgery, 71(12), 723-728.
[32] Cetintas, S.K., Kurt, M., Ozkan, L., Engin, K., Gokgoz, S.
and Tasdelen, I. (2006) Factors influencing axillary node
metastasis in breast cancer. Tumori, 92(5), 416-422.
[33] Fisher, B., Bauer, M., Wickerham, D.L., Redmond,
C.L.K. and Fisher, E.R. (1983) Relation of number of
positive axillary nodes to the prognosis of patients with
primary breast cancer. Cancer, 52(9), 1551-1557.
[34] Harden, S.P., Neal, A.J., Al-Nasiri, N., Ashley, S. and
Quercidella, R.G. (2001) Predicting axillary lymph node
metastases in patients with T1 infiltrating ductal carci-
noma of the breast. The Breast, 10(2), 155-159.
[35] Guern, A.S. and Vinh-Hung, V. (2008) Statistical distri-
bution of involved axillary lymph nodes in breast cancer.
Bull Cancer, 95(4), 449-455.
[36] Kendal, W.S. (2005) Statistical kinematics of axillary
nodal metastases in breast carcinoma. Clinical & Expe-
rimental Metastasis, 22(2), 177-183.
[37] Cameron, A.C. and Trivedi, P.K. (1998) Regression
Analysis of Count Data. Econometric Society Mono-
graph, Cambridge University Press, New York.
[38] Rose, C.E., Martin, S.W., Wannemuehler, K.A. and
Plikaytis, B.D. (2006) On the use of zero-inflated and
hurdle models for modeling vaccine adverse event count
data. Journal of Biopharmaceutical Statistics, 16(4),
463-481.
[39] Rampaul, R.S., Miremadi, A., Pinder, S.E., Lee, A. and
Ellis, I.O. (2001) Pathological validation and significance
of micrometastasis in sentinel nodes in primary breast
cancer. Breast Cancer Research, 3(2), 113-116.
[40] Schaapveld. M., Otter, R., de Vries, E.G., Fidler, V.,
Grond, J.A., van der Graaf, W.T., de Vogel, P.L. and Will-
emse, P.H. (2004) Variability in axillary lymph node dis-
section for breast cancer. Journal of Surgical Oncology,
87(1), 4-12.
[41] Martin, T.G., Wintle, B.A., Rhodes, J.R., Kuhnert, P.M.,
Field, S.A., Low-Choy, S.J., Tyre, A.J. and Possingham,
H.P. (2005) Zero tolerance ecology: Improving ecologi-
cal inference by modeling the source of zero observa-
tions. Ecology Letters, 8(11), 1235-1246.
[42] Zorn, C.J.W. (1996) Evaluating zero-inflated and hurdle
A. K. Dwivedi et al. / HEALTH 2 (2010) 641-651
Copyright © 2010 SciRes. http://www.scirp.org/journal/HEALTH/Openly accessible at
651
651
Poisson specifications. Midwest Political Science Assoc-
iation, San Diego.
[43] Boucher, J.P., Denuit, M. and Guillen, M. (2007) Risk
classification for claim counts: A comparative analysis of
various zero inflated mixed Poisson and hurdle models.
North American Actu arial Journal, 11 (4 ), 110-131.
[44] Bohning, D., Dietz, E., Schlattmann, P., Mendonca, L.
and Kirchner, U. (1999) The zero inflated Poisson model
and the decayed, missing and filled teeth index in dental
epidemiology. Journal of the Royal Statistical Society
(Series A), 162(2), 195-209.
[45] Cheung, Y.B. (2002) Zero-inflated models for regression
analysis of count data: A study of growth and develop-
ment. Statistics in Medicine, 21(10), 1461-1469.