SesIndexCreatoR: An R Package for Socioeconomic Indices Computation and Visualization ()
1. Introduction
When studying social inequalities, it is generally interesting to take into account the socioeconomic status (SES) of an individual, a neighborhood or a region rather than consider only one socioeconomic variable such as educational level or income. However, socioeconomic status is a complex and multidimensional concept which encompasses many aspects such as employment, income, education, housing and social bonds. All of these aspects can themselves be represented by various variables. To synthesize and consider these different aspects, one solution is to create a SES index.
There are already many existing SES indices, especially at the neighborhood level [1] -[9] . However, most of them use a small number of variables, combine variables with simple methods (such as Z-score) and/or select variables only from the litterature, which seems inappropriate for the purpose of the Equit’Area Project, a public health program focused on social and environmental health inequalities (http://www.equitarea.org), as detailed elsewhere [10] . Thus, a new statistical procedure to create neighborhood socioeconomic indices was developed. Basically, this procedure does not create an index from a set of determined and precise variables, but aims to select, from a large data set, variables which will compose the SES index. It is based on several successive principal component analyses and the whole procedure is detailed in the aforementioned article. It has already been successfully applied in several analyses aiming to study health or environmental inequalities [11] [12] .
Compared to other existing approaches to compute indices, our procedure is a slightly more complex to understand and apply, especially for non-statisticians than for some other SES indices; we specifically developed our model in a R package [13] , named SesIndexCreatoR. The package is freely available on the website of the Equit’Area project and on CRAN. The site http://www.equitarea.org/documents/packages_1.0-0/ contains the basic functions needed to run the procedure (in its entirety or only in some steps) and to obtain the corresponding SES index. The purpose of this package is to give tools as simple as possible to perform the procedure while keeping the various possibilities it offered, like using different data mining methods, adding illustrative units, or performing only one step of the procedure. Moreover, once the index is created, users can display all the results of the different analyses both in text and graphical output, and generate a report summary. The user may also create categories of this index with different methods (hierarchical clustering with or without k-nearest neighbors, quantiles, or intervals).
In this paper we present and illustrate the use of the SesIndexCreatoR package for Lille agglomeration (a large French metropolitan area). For further examples we recommend reading the works by Padilla et al. as mentioned above.
2. Material and Methods
2.1. Data
The example data provided in the SesIndexCreatoR package concerns one large city in France, Lille (Nord Pas de Calais region, northern France), and some adjacent municipalities. The statistical unit is the sub-municipal French census block groups (called IRIS) defined by the National Institute of Statistics and Economic Studies (INSEE). These units have an average of 2000 inhabitants and are constructed to be as homogeneous as possible in terms of socio-demographic characteristics and land use. Census block groups (BGs) are divided into three distinct categories: housing, economical activity and miscellaneous. Housing BGs are the most common, economical activity BGs include at least 1000 employees and at least twice as many employees as residents, and miscellaneous BGs are specific wide areas sparsely populated (leisure parks, port areas, forest, etc.). As activity and miscellaneous BGs have some particular profiles due to the way they are defined, they are treated in the example as illustrative units (meaning that they are not part of the procedure but will have an index value). For confidentiality and distribution reasons, the real BGs idenficators are replaced in the example data set with a simple number from 1 to 234 (which is the number of BGs of the area).
Socioeconomic data are taken from the 1999 national census (source: INSEE) and provide counts of population, households and residences at BG scale covering all the social, economic and demographic aspects. Median income at the BG scale is taken from a second database: the “Revenusfiscaux des ménages” database (source: INSEE-DGI). Using this raw data, 37 variables are defined at the BG scale based on the INSEE definitions. These variables are chosen to be representative of the theoretical concept of SES and in line with the variables most often used in the literature, or that could be considered as linked with the SES concept.
All variables are related to family structure, household type, immigration status, employment, income, education and housing (more details are available in Table 1 and Table 2). Some of the variables are intentionally redundant and represent the same notion, in view to determine which best represents this notion (using the algorithm
Table 1. Description of 37 socioeconomic variables available for the Lille agglomeration at the census block group scale, by domain. (Unless stated otherwise, variables are proportions expressed in %; a Redundant group “labor force”; b Redundant group “unemployment”; c Not a proportion).
Table 2. Description of the 37 socioeconomic variables available for the Lille agglomeration at the census block group scale, by domain (continued). (Unless stated otherwise, variables are proportions expressed in %).
implemented in the proposed package). In our example, there are two such groups: 7 variables of unemployment and 3 variables of labor force. We also note there are an unexpectedly high number of missing values for median income but, willing to keep this variable in the analysis, we filled missing values with the average value of the adjacent BGs.
2.2. SES Index Creation
The SES index creation procedure is detailed in Lalloué et al. [10] . Basically, it follows three successive steps:
(1) Study of the redundant variables. As already mentioned, several variables represent the same notion and we want to determine which best represented this notion. Therefore, one variable is selected for each group by applying principal component analysis (PCA) to each of the groups of redundant variables. The selected variable for each group is the one with the largest correlation with the first component of the PCA on the group.
(2) Selection of the variables. A PCA or a multiple factor analysis (MFA) on the remaining variables (i.e., non-redundant variables and variables selected in step (1) is used to select the variables with a contribution to the first component larger than the average one, i.e., variables that were best correlated with the first component ; i.e. variables that were best correlated with the first component. The choice of PCA or MFA depends on the willingness to give the same weight in the analysis to each domain (MFA) or not (PCA).
(3) Construction of the index. A final PCA is carried out including the variables selected in step (2) Provided that the first component of this PCA could be interpreted as a “SES component”, it is used to calculate the socioeconomic index as the reduced first component.
3. The SesIndexCreatoR Package
The SesIndexCreatoR package depends on the FactoMineR [14] [15] and class [16] . In particular, most of data analysis and visualization functions, such as principal component analysis or hierarchical clustering, used in this package come from FactoMineR. We thus refer the user to the FactoMineR package and its manual for details on PCA and HC functions outputs. The sources and binaries of the package SesIndexCreatoR are available on the Equit’Area website or on CRAN and the installation is standard.
Because the package is also aimed to be used by R novice users, the example data are not included as R dataset but as a text file, in order to show in the package’s manual how to import a file.
SesIndexCreatoR is composed of three main functions and several visualizing and internal functions (see Table 3):
SesIndex function creates a socioeconomic index such as defined in the Equit’Area project. It is possible to choose the starting set of variables, the potential redundant groups of variables, the potential supplementary units, the method of selection (PCA or MFA) and the step of the procedure to perform. Results include the final index and all the results of the intermediate steps.
SesClassif function creates socioeconomic categories, based on a socioeconomic index created by SesIndex function, with different technics such as hierarchical clustering, quantiles or equals subdivisions. Results include both a table with the original data set with class of each unit and the results of the classification technic (cut points, classes particularities,...).
SesReport function creates a .html file with a report summarizing the results of the different steps of the creation of a socioeconomic index with the SesIndex function and, if any, the classification of the index using the SesClassif function. This function also allows to create a.csv file containing the original data set and the index and, if any, the classification.
4. Example
First, the socioeconomic data from the text file are imported in a data frame:
R>library(“SesIndexCreatoR”) R>SesData<- read.table( + system.file(“extdata”,”SesData.txt”, package = “SesIndexCreatoR”), + header=TRUE,sep=“\t”, row.names=1)
The SesData.txt contains 37 socioeconomic variables and 1 type variable (giving the type of BG) for each BG
Table 3. Functions available in SesIndexCreatoR 1.0-1.
of the Lille municipality and adjacent municipalities, as describe in Section 2.1 Data. Then, the SesDatadataframe has 234 rows representing the BGs and 38 columns representing the variables.
As the SesIndex function needs vectors or lists of variables’ names as arguments, we then extract the different vectors and lists needed to call the function (with redundant groups). The first line of the following code chunk allows to extract the names of the variables to analyse as a vector. The remaining lines extract the names of the variables in the two groups of redundant variables (see Table 1) and create a list containing the two vectors of names for the groups of redundant variables.
R>varnames<- colnames(SesData)[2:ncol(SesData)]
R> group1 <- grep(“+Unemployed”, colnames(SesData), value=TRUE)
R> group2 <- grep(“+LabourForce”, colnames(SesData), value=TRUE)
R>groupvarnames<- list(group1, group2)
In order to consider activity and miscellaneous BGs as illustrative units, we extract the names of the corresponding rows (in our example, A is for “Activity” and D for “Miscellaneous” types of BGs):
R>illus<- rownames(SesData[SesData[,”Type”] %in% c(“A”, “D”),])
It is “now” possible to create a socioeconomic index described in Materiel and methods using SesIndex. Here, we will create a socioeconomic index using all the 3 steps. Two groups of redundant variables are defined in groupvarnamesand several BGs are set illustrative. By default, all the 3 steps are performed and step 2 uses a PCA.
R> index <- SesIndex(SesData, varnames=varnames, groupvarnames=groupvarnames,
+ sup=illus)
R>plot(index, choice=“ind”, label=“none”)
Once the index is created, we want to explore the results of the procedure. For instance, among the groups of redundant variables listed in Table 1 (Unemployment and Labor Force), the variables representing the best these groups and selected by our procedure are:
R> index$step1$selection
[1] “UnemployedTotal” “LabourForce”
Or, among the list of variables in Table 1 and Table 2 (except the redundant variables dropped at step 1), the variables selected to compose the SES index for Lille agglomeration are:
R> index$step2$selection
[1] “ForeignPop” “UnemployedTotal”
[3] “InsecureJobs” “SteadyJobs”
[5] “SingleParentFamilies” “NoDiplomas”
[7] “IndividualHouse” “MultipleDwellingUnits”
[9] “ParkingSpace” “NonOwner”
[11] “WithoutCar” “TwoOrMoreCars”
[13] “SubsidizedHousing” “MedianIncome”
R>plot(index, choice=“var”, step=2)
It is also possible to obtain detailed results of the data mining technics, like the correlation coefficients of the variables with the two first components of the second step analysis:
R> index$step2$analysis$var$coord[,c(1,2)]
Dim.1 Dim.2
UnderAge25 0.63630110 0.21862821
OverAge65 −0.43163887 −0.18186596
ForeignPop 0.77678724 −0.15910609
LabourForce −0.21536615 0.11435159
UnemployedTotal 0.87073008 −0.29624249
SelfEmployed −0.54334383 0.47689134
InsecureJobs 0.89130598 0.13786739
SteadyJobs −0.87777733 0.03703831
SingleParentFamilies 0.78556166 −0.19555464
NoDiplomas 0.64635948 −0.67461860
HouseholderAlone 0.45573846 0.72622501
AttendingSchool −0.37025379 −0.02709806
BasicGeneralQualifications −0.15542464 −0.87073704
GeneralCertificates −0.62380039 0.12642215
LowerTertiaryEducation −0.59037219 0.55626000
HigherEducationalDegree −0.36942886 0.79823028
Students 0.35951016 0.68000017
IndividualHouse −0.73553777 −0.46219873
MultipleDwellingUnits 0.74861629 0.43400686
BuiltBefore1968 0.09240263 −0.19191144
Builtafter1990 −0.05250108 0.60279988
ParkingSpace −0.77646906 0.09453152
NonOwner 0.83998699 0.24679532
Less40 m2 0.51827452 0.69853014
Larger150 m2 −0.43976155 0.19532641
WithoutCar 0.90096827 0.17029037
TwoOrMoreCars −0.87800029 −0.10183450
SubsidizedHousing 0.71268195 −0.27645939
MedianIncome −0.82471346 0.17342006
Or the proportion of variance explained by the four first components of the final step:
R> index$step3$analysis$eig[1:4,]
eigenvalue percentage of variance
comp 1 9.4233115 67.309368
comp 2 1.6390996 11.707854
comp 3 1.0678887 7.627776
comp 4 0.5014364 3.581689
cumulative percentage of variance
comp 1 67.30937
comp 2 79.01722
comp 3 86.64500
comp 4 90.22669
The above outputs are especially interesting to understand the procedure of variable selection. We can see in these results that the variables of total unemployment and total labor force were respectively selected from the groups of redundant unemployment variables and labor force variables. Then, for these two groups only these two variables were kept in the next steps.
We can see in the selection from the step 2 that only variables with the highest correlations with the first component were selected. Here, 14 variables out of 29 were kept for the final step and the construction of the SES index.Eventually, the first component of the final step PCA performed on these 14 variables explained more than 67% of the total variance.
R>plot(index, choice=“var”, step=3)
Some graphical outputs can be seen in Figures 1-3. Figure 1 is a synthetic view of the projection of the BGs on the first principal components of the PCA performed in step 2 and step 3. Black dots represent active units whereas blue circles represent illustrative units (i.e., BG of the economical activity or miscellaneous types). Due to the number of units, BG labels are not displayed here but are activated by default. The step 3 part of the figure allows to see that BG are mainly along the first component and have not an extremely important variability along the second component.
Figure 1. Synthetic view of the graphical outputs for individuals.
Figure 2. Correlation circle for the second step.
Figure 3. Correlation circle for the final step.
Figure 2 gives the circle of correlations of the PCA performed in step 2. Most of the variables seem to have a good correlation with the first component, both positively and negatively, whereas as the correlations with the second component are mainly positive (except for two variables). A few variables (5) are not well represented on this plane and may have higher correlations with the third or fourth component. On this figure, a first opposition between “variables of deprivation”, at the right, and “variables of favor”, at the left, can be seen.
Finally, Figure 3 shows the circle of correlations of the PCA performed in step 3. The opposition between the “deprivation” and the “favor” variables is clear, with a high positive correlation between the first component and proportions of non-owner, unemployment, insecure jobs, person without diploma, subsidized housing, ... and a high negative correlation between the first component and proportion of steady jobs, individual houses, .... The first component of this PCA can then be interpreted as a SES component and be used as a SES index.
We now want to create categories from the socioeconomic index. We use a hierarchical clustering followed by a k-nearest neighbor (k-nn) algorithm. We decide to have an automatic number of classes (i.e., to cut the hierarchical clustering tree where the relative loss of inertia is the highest):
R> categories <- SesClassif(index)
Others possibilities currently in the SesClassif function are to create classes with hierarchical clustering without k-nn consolidation, with quantiles or with equal range of values.
We can summarize some characteristics of the different categories using simple functions. For instance, it is possible to compare variables average values in each category and the overall mean:
R> for (i in 1:3) {
+ print(paste(“Category”,i))
+ print(round(categories$analysis$desc.var$quanti[[i]][,c(2,3,6)],2))
+
[1] “Category 1”
Mean in category Overall mean p.value
TwoOrMoreCars 0.33 0.21 0
IndividualHouse 0.71 0.45 0
SteadyJobs 0.73 0.65 0
ParkingSpace 0.60 0.43 0
MedianIncome 27529.06 21986.21 0
NoDiplomas 0.11 0.16 0
ForeignPop 0.02 0.05 0
SubsidizedHousing 0.08 0.26 0
UnemployedTotal 0.10 0.16 0
SingleParentFamilies 0.11 0.17 0
MultipleDwellingUnits 0.25 0.52 0
WithoutCar 0.16 0.29 0
InsecureJobs 0.09 0.13 0
NonOwner 0.35 0.58 0
[1] “Category 2”
Mean in category Overall mean p.value
MultipleDwellingUnits 0.66 0.52 0.00
NonOwner 0.69 0.58 0.00
WithoutCar 0.35 0.29 0.00
InsecureJobs 0.14 0.13 0.00
SingleParentFamilies 0.18 0.17 0.02
SteadyJobs 0.63 0.65 0.01
MedianIncome 19693.52 21986.21 0.00
ParkingSpace 0.35 0.43 0.00
IndividualHouse 0.30 0.45 0.00
TwoOrMoreCars 0.14 0.21 0.00
[1] “Category 3”
Mean in category Overall mean p.value
UnemployedTotal 0.33 0.16 0
ForeignPop 0.14 0.05 0
SingleParentFamilies 0.28 0.17 0
SubsidizedHousing 0.74 0.26 0
NoDiplomas 0.30 0.16 0
InsecureJobs 0.18 0.13 0
WithoutCar 0.47 0.29 0
NonOwner 0.90 0.58 0
MultipleDwellingUnits 0.85 0.52 0
IndividualHouse 0.13 0.45 0
TwoOrMoreCars 0.08 0.21 0
ParkingSpace 0.19 0.43 0
MedianIncome 12624.39 21986.21 0
SteadyJobs 0.46 0.65 0
NULL
R>plot(categories$analysis, choice=“map”, label=“none”, draw.tree=F)
We can see that the optimal number of categories (according to the inertia criterion) was 3. The description of these categories showed that they are organised by decreasing socioeconomic status. Indeed, category 1 has higher average values of variables like median income or proportion of steady jobs, and lower average values of proportion of unemployed people or proportion of subsidized housing; whereas category 3 has lower values of median income and higher values of unemployment. Figure 4 shows the projection of these categories on the two first axes of the final PCA (note that it is also possible to use directly plot(categories) to have both the dendrogram and the projection of the units).
Figure 4. Plot of the individuals by categories.
Eventually, we want to export the detailed results of all the three steps of creation of the SES index and of the classification. We also want to export a data file containing the index and the categories. To do so, the SesReport function is used to create .html report (see Appendix). By default, files are created in the current working directory with basename “SesReport” (which can be change as arguments of the SesReport function).
R>SesReport(categories)
5. Conclusions
In this article we presented the SesIndexCreatoR package, designed to easily create socio-economic indices with a reproductible statistical procedure. One originality of this procedure compared to other existing indices lays in selecting the final variables for the index by usage of data mining techniques rather than only information gleaned from a literature review, allowing to discard part of the subjectivity that may influence the choice of the variables. This data driven approach allows the data “speak by themself”.
The SesIndexCreatoR package allow applying this procedure in a versatile way, by specifying which steps of the procedure should be runned (for instance only step 2 if the aim is to compare selection of variables between metropolitan areas without create indices, or only step 3 if one wants all the introduced variables to be in the index), adding illustrative units or selecting the method used. Once the index created, several tools are available to visualize, synthetize, explore and export the results in a convenient way for further utilization.
We project to extend the package in the future and among other improvements we foresee to implement others methods of classification, to add more tools to help the interpretation of the results, or to allow other ways of visualization (such as mapping). However, these improvements will be made according to users’ returns and needs.
NOTES
*Corresponding author.