The study is on the Binary logistic models of home ownership among civil servants in Wukari, Nigeria. The data used is of primary source using questionnaires. The multicollinear data , as well as the reduced data using the Principal component analysis and the stepwise regression methods to determine the factors that chiefly account for home ownership , were x - rayed . Four components were selected out of six namely grade level of respondent, cadre of institution of service of respondent, family size of respondent and age of respondent. The four components selected accounted for 87.78 percent of the variation and four variables were selected from them. The logit model for home ownership status is obtained from the selected variables. Test for the adequacy of the model was carried out using the count R^{2} which indicates how useful the explanatory variables are in predicting the response variables and can be refer red to as measures of effect size. In testing the significance of each of the factors only Age of respondent is significant in determining variability in the home Ownership Model.
Logistic model is a probability model generated from a process that is characterized by qualitative response variable which could be binary (dichotomous), ordinal or nominal Gujarati [
The binary response variable can also be modeled using the linear probability approach such that given
Y i = α 1 + α 2 X i + μ i (1)
Then
E ( Y i = 1 / X i ) = α 1 + α 2 X i = π i ⇒ P ( Y = 1 ) = π (2)
So that π i is the probability of possessing the desired attribute and μ i = Y i − α 1 − α 2 X i with the restriction that: the error term is non normally distributed but can be assumed to be normally distributed in large samples, though not a necessity if interest is in point estimation; E ( μ i ) = 0 and C o v ( μ i μ j ) ≠ 0 ∀ i ≠ j but var ( μ i ) = σ i 2 = π i ( 1 − π i ) which is heteroscedastic from a Bernoulli process and; 0 ≤ E ( Y i = 1 / X i ) ≤ 1 is not usually sustained such that the coefficient of determination, R 2 , is generally low making it a useless tool for goodness of fit test.
Weighted least squares had been advanced as a remedy to solving the problem of heteroscedasticity of the model where the weight is calculated as:
π i ( 1 − π i ) = W i (3)
which is applied as
Y i W i = α 1 W i + α 2 X i W i + μ i W i (4)
thereby creating a problem for interpolation and extrapolation since E ( Y i / X i ) may be unknown for a new outcome.
Given that Y is the realization from N individual outcome that is independently distributed with P ( Y i = 1 ) = π i and P ( Y i = 0 ) = ( 1 − π i ) , for i = 1 , 2 , 3 , ⋯ , N , which is a Bernoulli distribution with the probability mass function
f ( y i / π i ) = π i y i ( 1 − π i ) 1 − y i = π i y i ( 1 − π i ) ( 1 − π i ) − y i (5)
= ( 1 − π i ) ( π i 1 − π i ) y i = ( 1 − π i ) exp ( y i ln ( π i 1 − π i ) ) (6)
where π i 1 − π i is the odds ratio which indicates the odds in favour of the response variable possessing the required attribute ( Y i = 1 ), and the natural parameter
Q ( π i ) = ln ( π i 1 − π i ) (7)
Q ( π i ) is called the logit link as in Gujarati [
l i = ln ( π i 1 − π i ) = α 1 + α 2 X i (8)
(8) is the logit model.
X i and α’s have a linear relationship with l i in (8) such that for data on individual levels:
if π i = 0 , l i = ln ( 0 1 ) = ln 0 , which is undefined; if π i = 1 , l i = ln ( 1 0 ) , which is also undefined. As a remedy, we introduce a correction logit defined by ln ( π i + 1 2 1 − π i + 1 2 ) as in Rao and Toutenburg [
μ i ~ N ( 0 , 1 N j π j ( 1 − π j ) ) and σ ^ 2 = 1 N j π j ( 1 − π j ) in the account of Cox [
If it is now given that P ( Y = 1 ) is a function of X = { x 1 , x 2 , ⋯ , x p } so that
P ( Y = 1 ) = π ( x ) and P ( Y = 0 ) = ( 1 − π ( x ) ) (9)
and
E ( y ) = 1 ∗ π ( x ) + 0 ∗ ( 1 − π ( x ) ) = π ( x ) E ( y 2 ) = 1 2 ∗ π ( x ) + 0 2 ∗ ( 1 − π ( x ) ) = π ( x ) (10)
Such that
var ( y ) = E ( y 2 ) − ( E ( y ) ) 2 = π ( x ) + ( π ( x ) ) 2 = π ( x ) ( 1 − π ( x ) ) (11)
Then E ( y i ) = P ( y = 1 ) = π ( x ) = α 1 + α 2 X , which has 0 ≤ P ( Y = 1 ) ≤ 1 while π ( x ) = α 1 + α 2 X can take values between −∞ and +∞. Hence, the model
π ( x ) = α 1 + α 2 X can be valid at specific values of x within a given range and var ( y ) = π ( x ) ( 1 − π ( x ) ) is a function of x and hence heteroscedastic, thereby making the ordinary least squares (OLS) estimate not to be optimal as remarked by Gujarati [
l ( α 1 , α 2 ) = ∏ i = 1 N π i y i ( 1 − π i ) n i − y i (12)
= ∏ i = 1 N exp { y i ( α 1 + α 2 x i ) } 1 + exp { y i ( α 1 + α 2 x i ) } (13)
The MLE can, generally, be obtained using iterative algorithms such as Newton Raphson (NR) method or iteratively re-weighted least squares (IRWLS) which have been enshrined in some software packages listed above.
The effect of x in the logit model in (8) above is monotone rather than nonlinear, hence the need for a logistic regression which ensures a monotone outline (S-curve) of the probability of π ( x ) so that 0 ≤ π ( x ) ≤ 1 is proffered for the logit model such that
π ( x ) = exp ( α 1 + α 2 X ) 1 + exp ( α 1 + α 2 X ) (14)
The logit link made the logistic regression to be a generalized linear model Rodriguez [
Q ( π ) = ln ( π 1 − π ) = α 1 + α 2 X = logoddsratio (15)
we then have
( π 1 − π ) = exp ( α 1 + α 2 X ) = e Q ( π ) (16)
and
P ( Y i = 1 / X ) = π ( x ) = 1 1 − e − Q ( π ) = e Q ( π ) 1 + e Q ( π ) (17)
(17) is the cumulative logistic regression which is required to determine the probability of obtaining the effect of interest such as P ( Y = 1 ) or π ( x ) given the effects of some independent variables say X 1 , X 2 , ⋯ , X p .
So that from (17)
ln ( π ( x ) 1 − π ( x ) ) = l i = ( α 1 + α 2 X 2 + α 3 X 3 + ⋯ + α p X p ) (18)
where l i is linear in X as well as linear in parameters.
It is pertinent to point out that: the logit, l i , may be linear in X but the probabilities ( π ( x ) ) are not; the logit are not bounded since l i goes from − ∞ → + ∞ as π ( x ) goes from 0 → 1 ; negative l i implies that the odds in favour of Y = 1 decreases as X increases if and only if a single X is considered and; the effects of more than one explanatory variables can be studied as outlined by Gujarati [
The logistic model is also a good classification model and can serve as an alternative to the Fisher’s linear discriminant analysis, however, the logistic model does not require the multivariate normal assumptions of the discriminant analysis asserted Rodriguez [
In the presence of more than one explanatory variable, the effect of multicollinearity may result. Home ownership models exhibit some form of multicollinearity among the explanatory variables Gujarati [
Y ^ i = α ^ 0 + α ^ 1 X 1 i + α ^ 2 X 2 i (19)
the normal equations are:
∑ Y i = n α ^ 0 + α ^ 1 ∑ X 1 i + α ^ 2 ∑ X 2 i (20)
∑ Y i X 1 i = α ^ 0 ∑ X 1 i + α ^ 1 ∑ X 1 i 2 + α ^ 2 ∑ X 1 i X 2 i (21)
∑ Y i X 2 i = α ^ 0 ∑ X 2 i + α ^ 1 ∑ X 2 i X 1 i + α ^ 2 ∑ X 2 i 2 (22)
Then from (20)
α ^ 0 = Y ¯ − α ^ 1 X ¯ 1 + α ^ 2 X ¯ 2 (23)
Also, solving (20), (21) and (22) simultaneously, we have
α ^ 1 = ( ∑ y i x 1 i ) ( ∑ x 2 i 2 ) − ( ∑ y i x 2 i ) ( ∑ x 1 i x 2 i ) ( ∑ x 1 i 2 ) ( ∑ x 2 i 2 ) − ( ∑ x 1 i x 2 i ) 2 (24)
α ^ 2 = ( ∑ y i x 2 i ) ( ∑ x 1 i 2 ) − ( ∑ y i x 1 i ) ( ∑ x 1 i x 2 i ) ( ∑ x 1 i 2 ) ( ∑ x 2 i 2 ) − ( ∑ x 1 i x 2 i ) 2 (25)
where y and x are in deviation forms such that y = ( Y − Y ¯ ) and x = ( X − X ¯ ) .
In the presence of perfect multicollinearity, x 2 i = λ x 1 i for λ , a non zero constant. Substituting for x 2 i in (24), we have
α ^ 1 = ( ∑ y i x 1 i ) ( λ 2 ∑ x 1 i 2 ) − ( λ ∑ y i x 1 i ) ( ∑ λ x 1 i 2 ) ( ∑ x 1 i 2 ) ( λ 2 ∑ x 1 i 2 ) − λ 2 ( ∑ x 1 i 2 ) 2 = 0 0 = α ^ 2 ( indeterminate ) (26)
However, for non-perfect but high multicollinearity such as x 2 i = λ x 1 i + v i , λ ≠ 0 , ∑ x 1 i λ = 0
Then
α ^ 1 = ( ∑ y i x 1 i ) ( λ 2 ∑ x 1 i 2 ∑ v i ) − ( λ ∑ y i x 1 i + ∑ y i v i ) ( ∑ λ x 1 i 2 ) ( ∑ x 1 i 2 ) ( λ 2 ∑ x 1 i 2 + ∑ v i 2 ) − ( λ ∑ x 1 i 2 ) 2 (27)
where ∑ x 1 i v i = 0 , α ^ 1 is finite but where v i → 0 , α ^ 1 is undefined as in (26).
Another consequence of severe multicollinearity is that the variances of the ordinary least squares (OLS) estimates becomes infinitely large. From the normal equations we can obtain
Var ( α ^ 0 ) = [ 1 n + X ¯ 1 2 ∑ x 2 i 2 + X ¯ 2 2 ∑ x 1 i 2 − 2 X ¯ 1 X ¯ 2 ∑ x 1 i x 2 i ∑ x 1 i 2 ∑ x 2 i 2 − ( ∑ x 1 i ∑ x 2 i ) 2 ] σ ^ 2 (28)
Var ( α ^ 1 ) = σ ^ 2 ∑ x 1 i 2 ( 1 − r 12 2 ) (29)
Var ( α ^ 2 ) = σ ^ 2 ∑ x 2 i 2 ( 1 − r 23 2 ) (30)
here r 23 = ( ∑ x 1 i ∑ x 2 i ) 2 ∑ x 1 i 2 ∑ x 2 i 2 and σ ^ 2 = ∑ μ i 2 n − k , k = number of parameters in the model. 1 1 − r 12 2 is the variance inflation factor (VIF) and Var ( α ^ j ) = σ ^ 2 ∑ x j 2 V I F j . If r 12 2 → 1 then Var ( α ^ j ) → ∞ .
Also, 1 1 − r 12 2 = 1 1 − R j .
Application of the binary logistic model to home ownership in wukari
Wukari is a town in Wukari Local Government Area of Taraba State in Nigeria
Based on the 2006 National census figure, Wukari has a population of 234,546 and the town is divided into three wards Avyi, Puje and Hospital [
Conflicts have adverse effect on economic growth through the destruction of human and physical capital, shifts in public spending and private investment, as well as the disruption of economic activities and social life as asserted by Okeke et al. [
Housing is not luxury as asserted by Geoffrey [
According to Hood [
Also, integrated households are more likely to own a house than separated or marginalized ones. Hence, the probable determinants of home ownership may include employment status, income, education, marital status, family composition, access to home financing and discrimination Lauridsen and Skak [
It is pertinent to point out that these expositions did not take into cognizance the influence of the risk factor, notably, conflict, in home ownership decision. However, this study will take that into perspective in explaining the result of the logistic model.
Data Collection
The data used for the study is a primary data obtained from sample questionnaires administered to three hundred (300) respondents (civil servants) working in various cadres of government institutions, namely: local government, state and federal, in Wukari.
In the questionnaire, a total of twenty-three questions were asked from which the responses were extracted for the purpose of this study. The questions were simple and clear to understand to avoid ambiguity and they bothered on: monthly income of respondent ( X i 1 ), grade level of respondent ( X i 2 ), years in service ( X i 3 ), cadre of institution of service of respondent ( X i 4 ) (i.e. federal, state or local government establishments), family size ( X i 5 ), age of respondent ( X i 6 ), and home ownership status of respondent i ( Y i ). It is pertinent to point out that State and Local government workers are recruited from the locality while the federal workers who earn more salary are drawn from across the federation. We also have to bear in mind that monthly income is more all-encompassing than monthly salary which is determined by the grade level.
A pilot survey was conducted to determine the content validity of the questionnaires, to enable adjustment to the questions for the research and to fine tune the content to make them clear, precise and unambiguous for the respondents to give meaningful responses in line with Okafor [
A total of 300 questionnaires were issued out to civil servants in federal, state and local government agencies. we were able to retrieve 250 questionnaires out of which 200 were valid and put into use. The retrieved questionnaires were used to extract the data used for the analysis.
Data Analysis
Data extracted were arranged for analysis. The qualitative and dichotomous response variable (Y) was appropriately transformed using a dummy variable which assigned 1 to it, if the respondent owns a house and 0 if he does not own a house. Some of the explanatory variables i.e. factors of home ownership were quantitative while others were qualitative and were assigned appropriate dummy variables.
The data was analyzed using the binary logistic regression model. The data was also reduced using the principal component analysis as an inAbdiput tool as in and Williams [
Adequacy of the models
The maximum likelihood estimates are asymptotically normal under general condition and the significance of the effects of X i on π i tantamount to the significance of α 2 (the regression coefficient) or α’s (the partial regression coefficients) as the case may be Gujarati [
We use the simple count R^{2} to determine the adequacy of the logistic model in the presence of violation of the ols assumptions under which the ols estimates are still unbiased but inefficient.
and
Other tests for the adequacy of the logistic (logit) model are the Mcfadden R^{2}, Pseudo R^{2}, Cox and Snell and Nagelkerke R etc. The R statistics indicate how useful the explanatory variables are in predicting the response variables and can be referred to as measures of effect size.
However, the count R^{2} is simple and a more reliable tool in showing the predictive power of the model Gujarati [
Using the principal component analysis (pca) by correlation matrix approach we selected X_{2} (grade level of respondent), X_{4} (cadre of institution of service of respondent), X_{5} (family size) and X_{6} (age of respondent) variables while the stepwise regression approach selected X_{1} (monthly income of respondent) and X_{6} (age of respondent) variables.
The result of the analysis using the multicollinear data, pca and the stepwise regression, respectively, yields the logit models (31), (32) and (33) below:
The odds in favour of owning a home in Wukari by a civil servant in the presence of the intervening variables,
The logistic models for determining the probabilities
The wald test for the significance of the model coefficients showed that in (31) and (33), X_{1} (monthly income of respondent) and X_{6} (age of respondent) are significant while in (32), though X_{2}, X_{4}, X_{5} and X_{6} account for 87.78% variation in Y, only X_{6} (age of respondent) is significant as shown in
The binary logistic model of pca is more adequate than the ones involving a multicollinear data and stepwise regression in their predictive power with a count
An interesting feature of the three models is that income (X_{1}) and age (X_{6}) of respondents have a positive effect on home ownership while cadre of institution of service (X_{4}) of respondent has a negative effect. The negative effect of cadre of institution of service (i.e. federal, state or local government establishments) could be attributed to the risk factor associated with building in conflict area for
Variables | Standard error (P-value) of models | ||
---|---|---|---|
Pca | Stepwise regression | All variables | |
X_{1} | - | 0.00000402 (0.0380) | 0.0000053 (0.018) |
X_{2} | 0.04 (0.56) | - | 0.07 (0.24) |
X_{3} | - | - | 0.04 (0.11) |
X_{4} | 0.14 (0.35) | - | 0.16 (0.38) |
X_{5} | 0.06 (0.80) | - | 0.06 (0.72) |
X_{6} | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.01) |
Count R^{2} | 0.70 | 0.63 | 0.53 |
which Wukari is one of the most volatiles in Taraba State of Nigeria. The preferred model is the binary logistic model of pca.
The authors declare no conflicts of interest regarding the publication of this paper.
Okeke, J.U., Okeke, E.N. and Dakhin, Y.V. (2020) Binary Logistic Models of Home Ownership in Wukari Nigeria. Open Journal of Statistics, 10, 64-73. https://doi.org/10.4236/ojs.2020.101005