Survival Analysis in Modeling the Birth Interval of the First Child in Indonesia

First birth interval is one of the examples of survival data. One of the characteristics of survival data is its observation period that is fully unobservable or censored. Analyzing the censored data using ordinary methods will lead to bias, so that reducing such bias required a certain method called survival analysis. There are two methods used in survival analysis that are parametric and non-parametric method. The objective of this paper is to determine the appropriate method for modeling the birth of the first child. The exponential model with the inclusion of covariates is used as parametric method, considering that the newly married couples tend to have desire for having baby as soon as possible and such desire will be weakened by increasing age of marriage. The data that will be analyzed were taken from the Indonesia Demographic and Health Survey (IDHS) 2012. The result of data analysis shows that the birth of the first child data is not exponentially distributed thus the Cox proportional hazard method is used. Because of the suspicion that disproportional covariate exists, then the proportional hazard test is conducted to show that the covariate of age is not proportional, the generalized Cox proportional method is used, namely Cox extended that allows the inclusion of disproportional covariates. The result of analysis using Cox extended model indicates that the factors affecting the birth of the first child in Indonesia are the area of residence, educational history and its age.


Introduction
In demography, there are three things which take effect, namely mortality, migration, and fertility.The birth interval of the first child can be used as one of indicators of fertility.The birth interval of the first child is defined as difference between the age of marriage and the age of birth of the first child.In fact, the length of birth interval of the first child of every married woman is not same.According to existing research, the birth interval of the first child is determined by many kinds of social and culture even physiology factors.According to [1], there are several factors that affect the birth interval of the first child, which are the area of residence, educational level, age, the knowledge of contraception and employment status.Most of married women have given birth to their first child a couple months right after the marriage, so that the obtained data are the complete one, but the others who do not have children yet are classified as the censored data.Thus the birth interval of the first child is an example of the survival data.The survival data are the data that indicate the period of time since the initial observation to a happened event.The characteristic of the survival data is its survival time, which usually could not be fully observed (censored) [2].If all the expected events have happened, and are able to be fully observed, some analysis methods can be done; unfortunately, the survival data are censored [3].Analyzing the survival data using the ordinary method would be inappropriate because of causing bias [4].A certain method to analyze it was needed for reducing such bias.This certain method is called survival analysis.

Definitions Associated with Survival Analysis
Definition 1: Survival time measures the period of time from the initial event to happened event such as failure, death, respond, symptoms, and etc. [5].
Definition 2: Survival function is a function indicates the probability of an individual who survives until or more than t time (experiencing the event after t time) [6].Consider random variable T, then the survival function define as, ( ) ( ) Consider f as the probability density function, then the survival function also defines as complement of cumulative function F as, ( Definition 3: Hazard function is a function indicates the probability of an individual of having risk or experiencing event such as failure or die at t time in the condition that this individual survived to t time, the function given by: ( ) ( ) From those definitions above, the relation between the survival function and hazard function are obtained.
Using the definition of the conditional probability, obtained:

Model
In this research, there are two survival methods that will be used, i.e. exponential parametric and non-parametric method.

Parametric Method
Survival time can be analyzed by using the accelerated failure time (AFT) model.In the survival time, this model assumes that the logarithm relation of survival time T and its covariates are linear and can be written as , exp exp

Kaplan-Meier Model
In the Kaplan-Meier method, assuming the distribution of the data is discrete.According to [10], Kaplan-Meier is a method used to compare the survival time of two covariate groups.The advantage of this method is that the non-parametric method does not require the knowledge of a particular distribution [11].This method is appropriate because the data used are the individual data, but still appropriate for small, medium and large data sizes.Suppose the time of the birth of the first child denotes by r and the number of women who are married denotes by n, where r ≤ n.The probability of the birth of the first child in every interval j estimated by , and its survival probability estimated by ˆ. j j j j n d p n The estimator for survival function of Kaplan-Meier method is given by, ( )

Cox Proportional Hazard and Non-Proportional Hazard Model
Cox proportional hazard is a model usually used as multivariate approach to analyze the data [12].The characteristic of Cox proportional hazard model indicates that every different individual has proportional hazard function, that is, ( ) ( )  , the ratio of the hazard function of two individuals with the inclusion of cova- riates ( ) ( ) are constant.This means that the ratio of failure risk of two individuals is the same and does not depend on how long they survived.[13] explains that the general form of Cox proportional hazard model is: where x denotes covariate, but he did not make any assumptions about the form of ( ) 0 h t itself, which is called the baseline of hazard function, because it is the value of hazard function when 0. x = Sometimes the time-dependent covariate was found, so it is not met with the proportional assumption, then the form above was developed into Cox extended model: , exp For checking the proportional assumption of covariate, the Cox extended model can be written as: where ( ) j g t denotes function of time and it is important to determine the proper form of ( ). ( ) If the test result j δ is significant then the Cox extended model is better than the Cox proportional hazard model, thus the ratio of hazard function is a function of time. 3) ( ) j g t heavyside function.When this function is used then obtained the constant of hazard ratio for dif- ferent time intervals.

Parameter Estimation
In estimating parameter 1 2 , , , ,  Cox using the maximum likelihood estimation method (the maximum likelihood estimator) by only considering the individual probability who are experiencing the event called partial likelihood [14].Estimation j β using partial likelihood means maximizing the partial likelihood function.Partial likelihood function is the joint probability survival function of uncensored data formed by the function of unknown value of parameter.[7] stated that parameter estimation β can be proved by taking survival individ- ual cases such as the death event.Suppose there are n individuals with r individuals who are having death then there exist ( ) n r − individuals censored.Assumed that there is only an individual who died at a certain time of death (there is no ties).Again suppose that ( ) ( ) ( ) ( ) is the ordered and uncensored survival time.The probability of the i-th individual death at time ( ) in terms of ( ) as the only exact time of death of , , , , r t t t t  and the covariate for died individual at time ( ) denote as: ( ) ( ) Thus the likelihood function of the conditional probability above is , , , , n t t t t  and i δ denotes the indicator of event by valued 0, th individual right censored 1, others.
Then the equation of likelihood function can be written as If equation turned to be logarithm then obtained Parameter estimator β can be obtained by maximizing the log function ( ) L β , so that the solution be ( ) The solution of such equation is hardly solved analytically but more easily solved using numeric method.

Data
The data used in this research taken from the result of Indonesia Demographic and Health Survey in 2002.The samples used are the data of two provinces, that are West Papua and Special Area of Yogyakarta as representation of high and low fertility level area.The data limited to the birth interval of the first child, of first time married woman.

Operational Definition
The dependent variable used in this research is the birth interval of the first child of a first time married woman.Whereas the independent variables that estimated to affect the birth interval of the first child are: 1) Area of residence, grouped in the smallest administrative unit of area, that are urban and countryside area.The area of residence divided into two categories, that are city = 1 and village = 2.
2) Education, schools are the formal schools start from the primary, junior, and high school, including the equated education.Those who never sign into formal education or ever been in primary school but never getting a passing mark are classified as not finished the primary school.The highest education divided into four categories, that are not finished the primary school = 0, finished the primary school = 1, finished the junior high school = 2, and finished the senior high school or higher = 3.
3) Employment status, working is the activity of doing the job with purpose of obtaining or earning income or profit for at least one hour per week continuously and uninterrupted (including as unpaid family workers who helped in the business/economic activity).Employment status are categorized into unemployment = 0 and employment = 1.
4) Knowledge about contraception, divided by two categories, unknown = 0 and known = 1.5) Age of mother, age of mother/first time married woman expressed in years.

Results
The Kolmogorov-Smirnov goodness of fit test can be applied to test whether a sample theoretically follows a population distribution.Based on the calculating results obtained that the value of D = 0.258 and the critical value of table D* for significance level 0.05 is 0.04025.Since the calculating value of D is greater than the critical value of table D then reject H 0 , which means that the data is not exponentially distributed.Furthermore, since the data of birth interval of the first child is not exponentially distributed then the exponential method is not used for analyzing the factors that supposed dominantly affecting the birth interval of the first child.
As an illustration, the birth interval data of the first child will be analyzed using the Kaplan-Meier method.Its respond variable is the time of marriage to the time of having the first child and the area of residence is the variable that affect the survival level (0 = village, 1 = city).
The survival analysis results using the Kaplan-Meier method for independent area of residence of variable can be shown in graphic as follows.
From Figure 1 shown that the survived individual who lived in the villages are different with those who lived in the cities.Next the hypothesis test will be conducted to see whether this difference significantly affect or just coincidence, by the following hypothesis, ( : is the statistic test value then the rejection area is H 0 .The log rank test results based on the area of residence covariate status are presented in the Table 1.
By using the significance level 0.05 of chi-square test with degree of freedom 1 obtained W value significant enough to reject H 0 .So, it can be concluded that there is significantly difference among the survival level of the birth of the first child, who lived in the village and in the city.
If the survival data that will be compared are comes from more than two individual groups, e.g.we are interested in seeing the characteristic difference of the area of residence, education level, age, knowledge about contraception and so forth, then the Kaplan-Meier method will be impractical.This caused of in the Kaplan-Meier method, every two of population groups must be tested separately, so if there are several groups then repeated testing must be conducted.If the respondents have several characteristics, then the Cox proportional hazard method able to explain the influence of such characteristics to the respond variables simultaneously.
Table 2 shows that the explanatory variables significantly affected to the interval birth of the first child are the area of residence, education 1, education 2, education 3, and age.Next the estimation to the proportional assumption of every covariate was done.In this paper, the only method used to check the covariate that does not met the proportional assumption in the birth interval of the first child is Schoenfeld residuals method.The results are presented in Table 2.
The covariate of p-value of individual aged less than 0.05 means there is correlation between such covariate to the time rating until that individual had her first child, so that the age covariate does not met the proportional assumption.Thus the data re-modeled using the Cox extended model.Before the analysis is done, conducted the test previously to see more appropriate form of ( ) j g t .This test uses the AIC criteria.The AIC is a measure for selecting the best regression model, introduced by Hirotugu Akaike in 1973.The AIC method is based on the maximum likelihood estimation [15], with the equation as follows where L indicates the likelihood function, q indicates the sum of parameter β , and α indicates the speci- fied constant.The value of α that often used is between 2 and 6.Based on the AIC method, the best regression model is the model that has the smallest value of AIC [9].The value of AIC formed in ( ) is the smallest among the two other models (Table 3).Thus this is the best model.
Next the test was done to see significantly affected covariate to the respond variable.The results are presented in Table 4.
In Table 4 seen that the explanatory variables that significantly affect to the birth interval of the first child are the area of residence, education 1, education 2, education 3, and age.

Interpretation
The value of hazard ratio allows us to compare among multiple groups in the survival analysis [16].Table 5 shows that the area of residence, education 1, education 2, education 3, and age variables are significantly affect to the individual risk of having first child.From the value of hazard ratio seen that individual who lived in the city had 0.720 times lower risk of having their first child than who lived in village one, and the educational history of individual who finished their primary, junior, senior high school or higher will increase the risk of having their first child respectively 1.708, 2.648, 4.361 times of individual who had lower educational history or not finished their primary school at all.Every additional age of one year will reduce the risk of having first child as 4.5%.
Table 4 shows that the area of residence, education 1, education 2, education 3, and age variables are the significantly affected to the individual risk of having her first child.From the value of hazard ratio seen that individual who lived in the city had 0.719 times lower risk of having their first child than who lived in village one, and the educational history of individual who finished her primary, junior, senior high school or higher will increase the risk of having their first child respectively 1.730, 2.648, 4.235 times of individual who had lower educational history or not finished their primary school at all.Every additional age of one year will reduce the risk of having first child as 3.0166%.

Conclusions
The result of distribution tests by using the Kolmogorov-Smirnov test indicates that the data do not have a particular distribution, so the Cox proportional hazard model of non-parametric method is a more appropriate method for modeling the birth interval data of the first child.However, after testing the assumption of proportional hazard model, it turned out that one of the covariates did not meet the proportional assumption, so that the data re-modeled by using generalized Cox proportional hazard model called Cox extended model.
Although there are no differences among the explanatory variables that significantly affect the using of Cox proportional as Cox extended model, but the interpretation of both models is still different and based on the conducted test.The Cox extended model is the best model in modeling the birth data of the first child in Indonesia.

1 σ.
of σ indicates the scale para- meter and ε indicates the error.[9]states that for inclusion of covariates in the exponential distribution, we use the equation above and choose T is the exponential distribution with hazard function, probability density function, and survival function respectively is the simplest form by resulting the Coxproportinal hazard model.2) vector for individual who died at time ( ) Multiplication in the likelihood func- tion only for uncensored individual.The censored individual excluding in the numerator but in the denominator is the sum of data consist of n observation of survival time 1 2 3

Figure 1 .
Figure 1.Survival function of Kaplan-Meier method for independent area of residence variables.

Table 1 .
The log rank test result to see the difference of the survival level of the birth of the first child based on the area of residence covariate.

Table 2 .
Correlation and p-value of explanatory variables.

Table 3 .
The value of AIC model.

Table 4 .
Parameter estimator, p-value, and the hazard ratio by using the cox extended model.

Table 5 .
Parameter estimator, p-value, and hazard ratio using cox proportional hazard model.