^{1}

^{*}

^{1}

^{*}

The main objective of this study is to analyze work travel-related behavior through a set of variables relative to socio-economic class, urban environment and travel characteristics. The Principal Component Analysis was applied in a sample consisting of workers of the S ?o Paulo Metropolitan Area, based on the origin-destination home interview survey, carried out in 1997, in order to: 1) examine the interdependence between travel patterns and a set of socioeconomic and urban environment variables; 2) determine if the original database can be synthetized on components. The results enabled to observe relations between the individual’s socio-economic class and car usage, characteristics of urban environment and destination choices, as well as age and non-motorized travel mode choice. It is then concluded that the database can be adequately summarized in three components for subsequent analysis: 1) urban environment; 2) socio-economic class; and 3) family structure.

Personal displacement behavior depends largely on two groups of variables: socioeconomic characteristics, and urban environment factors (residential density, proximity of localities, spatial coverage of the transportation net- work, etc.).

The influence of individual socioeconomic and household characteristics on choosing travel patterns has been studied over the years [

Urban environmental factors such as land use and distribution of road network infrastructure belong to the variables that individuals take in account when making their travel decisions (travel mode choices, destination, route, etc.). Urban densities, city shape, more or less dispersed activities in the urban environment, for example, are strongly related to the modal choice.

The main objective is to analyze, through exploratory techniques, the individual work-travel behavior through a set of variables related to socioeconomic characteristics and to the urban environment (distribution and intensity of opportunities).

The Principal Component Analysis (PCA) was applied to a sample of workers of the São Paulo Metropolitan Area (SPMA) in order to: 1) examine patterns or relations a priori unknown between travels and a large set of variables; 2) determine if the original database can be synthesized in a set of components.

This paper analyzes the interdependence between a set of variables and work-related travel behavior. A sequence of steps was performed to examine three main hypotheses: 1) if characteristics related to the individual’s socio-economic class affect the modal choice and travel distances; 2) if variables related to the family structure affect the travel behavior; 3) if distribution and intensity of opportunities in the urban environment have effect on destination choice decisions.

This paper is organized into sections and subsections according to the methodological framework presented in

In general, socioeconomic characteristics are strongly related to human behavior. Some attributes (such as income) provide an appropriate base for population segmentation and comprehension of individual behavior, particularly travel behavior.

Travel pattern choices are strongly related to the individual’s socio-economic class, traditional division of household roles and family structure [

A broad range of household characteristics influence the complexity of trip chains including age, income, gender. Like household characteristics, travel patterns such as origin, destination, purpose, characteristics of transportation system and the number of vehicles per household also influence trip chain behavior [

Hanson and Hanson [

Mitchell and Town [

Kitamura [

Dargay [

The household structure and family size also can influence travel patterns. Travel is part of a structure of household activities. The division of household tasks, influenced mostly by the individual’s family role (family heads, spouse, etc.) is fundamental, considering the differences found on the travel patterns performed by the family members. Women classified as non-spouse and non-family heads, represent higher number of work-commuters. On the other hand, women classified as spouse or family heads represent lower rates of work-commuters [

Lee et al. [

Considering the relation between urban density and travel behavior, Newman and Kenworthy [

Handy [

This study is based on the origin-destination home interview survey carried out by METRO-SP in São Paulo Metropolitan Area (SPMA) in 1997. At that time, SPMA’s population was of approximately 17 million, distributed in 39 counties and 389 traffic zones. The interview survey included socioeconomic and travel characteristics data. Originally, the SPMA database is composed of 98,780 individuals.

The data processing phase includes the following steps: 1) removal of incomplete data; 2) omission of individuals who do not work; 3) removal of individuals who do not work at industrial, commercial or service sectors; 4) omission of individuals who did not travel the day before the interview survey. Thus, a sample of workers in the industrial, commercial and service sector was obtained, composed of 24,335 individuals.

Principal Components Analysis (PCA) is an exploratory multivariate data analysis technique whose main objective is to detect the structure of a large set of variables (data patterns and relations) and to reduce multidimensional data sets to lower dimensions.

From the dependence structure between the variables in question, the PCA enables the creation of a lower set of variables (factors or components) obtained according to the original variables. Additionally, it is possible to know to what extent each component is associated with each variable and how much the set of components explains the variability of the original data.

PCA is a technique of interdependence and analyzes the mutual association between all numeric variables. The use of PCA is mainly recommended in studies of multiple variables with high inter-correlation. Its application can be used to solve multicollinearity problems as the factor is a combination of variables, which are inter- orthogonal [

One of the main advantages of PCA is that there is no presupposition of normality of the variables involved. The components are obtained from a decomposition of the correlation matrix. The factorial loads are the result of this decomposition, which indicate how much each variable is associated with each one of the factors involved.

The eigenvalues are numbers that reflect the importance of the factor. When the number of factors is the same as the number of variables, the sum of eigenvalues corresponds to the sum of the variance of these variables. The factorial loads are parameters which express the covariance between each factor and the original variables. When standardized variables are used (correlation matrix), these values correspond to the correlation between the factor and the original variables.

Hence, PCA can: 1) identify the structure or relations between variables by analyzing the correlation between variables (Analysis 1 of this paper); 2) create a new and lower set of variables (factors/components) that can replace partially or totally the original one for application of the following techniques (Analysis 2 of this paper). There are four steps to using PCA, described as follows.

1) Calculate the correlation matrix of the variables under study?verify the level of association of the variables among themselves;

2) Extract the more significant components that explain as much information as possible (maximum variability data?eigenvalues should usually be equal to or greater than 1;

3) Understand and name each factor, observing the contribution (factorial load) of all the variables;

4) Factorial load generation (for each individual) to be used in other analyses.

At this stage, the variables that are interrelated with the travel patterns were selected. The variables related to the socioeconomic characteristics were chosen through current literature and data availability. As mentioned before, PCA allows the use only of numeric variables. Thus, through the original data base the variables were selected and adapted as shown in

In this paper, the variables related to the urban environment represented the distribution and intensity of opportunities in SPMA. To prepare these variables, the premise taken in the intervening opportunities model was used, which considers that in a urban area all travels are as short as possible, and are only as long as necessary to achieve the closest destination accepted, in which the traveler’s goal is satisfied. Such variables were repre- sented by the level of “cumulative opportunity” by buffer distance.

Socioeconomic variables | |||
---|---|---|---|

FAM | Number of family members | AGE | Age |

CAR | Number of cars in the household | II | Individual income |

FI | Family income (R$) | NCH | Number of children at the household |

What is the meaning of level of “cumulative opportunity”? In this paper, opportunity refers to the job offer (industry, commerce and services). Hence, the term “opportunity” represents a “proportion” (%) of total of employment.

Then, this proportion of jobs was accumulated by distance buffers, considered between the centroids of the residence area to the centroids of the zones located at : 5 km, 10 km, 15 km and 20 km, thus generating the term: “cumulative opportunity”.

The input variables related to travel behavior took into account the main trips performed by individuals the day before the home interview survey. Main trips are those in which remained for a longer period in its destination. Most of the travels performed were work-related trips.

Therefore, main travels were characterized by their purpose, travel mode, and distance traveled. In this paper, three travel purposes were considered: 1) WORK?work industry, commerce, or services; 2) SCH?school; and 3) ACT?other activities. For travel modes, three different categories were analyzed: 1) CAR?private motorized travel mode; 2) PUB?transit; 3) NM?non-motorized travel mode.

Finally, the traveled distances were clustered in 4 groups: 5 km?for less than 5 km; 2) 5 to 10 km?between 5 and 10 km; 3) 10 to 15 km?between 10 and 15 km; and 4) 15 km?more than 15 km. By grouping the three travel attributes, each category indicated a purpose, the travel mode, and the travel distance, simultaneously. Ta- ble 2 shows all the categories considered in this paper.

However, there is an important question to be discussed. How to measure numerically each one of the categories since the PCA is a technique which allows only the use of numeric variables? Thus, applying the CART algorithm (Classification and Regression Tree, described in the next section), the probability of each worker performing each of the 27 travel categories was calculated. The next subsection describes CART algorithm, its use in the paper, thus the conversion of a categorical variable to a numeric variable.

Use of CART?Auxiliary ToolIn this paper a variant of the CART algorithm contained in the SPSS 19.0 software was used as an auxiliary tool. CART establishes a relationship within independent and dependent variables. The algorithm is adjusted by successive binary splitting in the data set, and so the resulting subsets are increasingly more homogeneous in relation to the dependent variable. These divisions are represented by a binary tree structure, and each node corresponds to a splitting [

CART is an exploratory technique and can be defined as an acyclic graph that satisfies the following properties: 1) the hierarchy is called tree and each segment is known as a node; 2) there is a node, called root node, which contains the complete database; 3) the root node is divided sequentially, generating child nodes; 4) there is only one path within the root node and each node; 5) when no further data subdivision is possible, the final subgroups are considered terminal nodes or leaves; 6) for construction of the CART, three main elements should be determined: a set of questions delimiting data division, a criterion for evaluating the best division and a rule for the conclusion of the other subdivisions (stop-splitting rule).

The application of CART in this work was used to calculate the likelihood of each individual of the sample to

undertake each of the 27 travel categories represented in

1) Individual A is observed; 2) with the application of the CART algorithm, individual A is classified in node 8; 3) the individuals that compose leaf 8 have a 31.31% probability to perform the category of TRAB-PUB-15 km, for example; 4) the probability of performing the TRAB-PUB-15 km (0.3131) is the value of the variable called “TRAB-PUB-15 km” for the individual A; Thus, each one of the 27 categories is associated to a numeric number that corresponds to the probability an individual will perform the specified travel characteristic. The following values are observed for individual A: TRAB-PUB-5 to 10 km (0.2197); TRAB-PUB-10 to 15 km (0.1644).

For the individual B, for example, the variable TRAB-PUB-15 km assumes the value of 0.1175. The variable TRAB-AUT-5 km assumes the value 0.1235, whereas the value of the variable TRAB-NM-5 km is 0.1096.

In order to examine patterns or relations a priori unknown between travel and a large set of variables, the first analysis with PCA application was performed, applying a total of 32 numeric variables.

1 | WORK-CAR-5 km | Work | Car | Until 5 km |
---|---|---|---|---|

2 | WORK-CAR-5 to 10 km | Work | Car | 5 to 10 km |

3 | WORK-CAR-10 to 15 km | Work | Car | 10 to 15 km |

4 | WORK-CAR-15 km | Work | Car | Above 15 km |

5 | SCH-CAR-5 km | School | Car | Until 5 km |

6 | SCH-CAR-5 to 10 km | School | Car | 5 to 10 km |

7 | SCH-CAR-10 to 15 km | School | Car | 10 to 15 km |

8 | SCH-CAR-15 km | School | Car | Above 15 km |

9 | ACT-CAR-5 km | Activity | Car | Until 5 km |

10 | ACT-CAR-5 to 10 km | Activity | Car | 5 to 10 km |

11 | ACT-CAR-10 to 15 km | Activity | Car | 10 to 15 km |

12 | ACT-CAR-15 km | Activity | Car | Above 15 km |

13 | WORK-PUB-5 km | Work | Transit | Until 5 km |

14 | WORK-PUB-5 to 10 km | Work | Transit | 5 to 10 km |

15 | WORK-PUB-10 to 15 km | Work | Transit | 10 to 15 km |

16 | WORK-PUB-15 km | Work | Transit | Above 15 km |

17 | SCH-PUB-5 km | School | Transit | Until 5 km |

18 | SCH-PUB-5 to 10 km | School | Transit | 5 to 10 km |

19 | SCH-PUB-10 to 15 km | School | Transit | 10 to 15 km |

20 | SCH-PUB-15 km | School | Transit | Above 15 km |

21 | ACT-PUB-5 km | Activity | Transit | Until 5 km |

22 | ACT-PUB-5 to 10 km | Activity | Transit | 5 to 10 km |

23 | ACT-PUB-10 to 15 km | Activity | Transit | 10 to 15 km |

24 | ACT-PUB-15 km | Activity | Transit | Above 15 km |

25 | TRAB-NM-5 km | Work | Non-motorized | Until 5 km |

26 | SCH-NM-5 km | School | Non-motorized | Until 5 km |

27 | ACT-NM-5 km | Activity | Non-motorized | Until 5 km |

A) six socioeconomic variables (FAM?number of individuals in the family, CAR?number of cars per household, RF?family income, AGE?age, II?individual income, NCH?number of children per household); B) four urban environment variables (until 5 km?cumulative opportunities until 5 km; until 10 km?cumuli- tive opportunities until 10 km; until 15 km?cumulative opportunities until 15 km; until 20 km?cumulative opportunities until 20 km); C) twenty two variables related to travel patterns (likelihood to perform 22 of 27 categories described in

As expected, most variables are intercorrelated, hence there is redundant information in the original database. Therefore, summarizing the database into a smaller number of components clearly makes sense, in order to justify the PCA application.

The higher correlation values were found for the variables related to urban environment (this was expected as these variables are cumulative). The same variables are inversely correlated to the categories ACT-PUB-15 km and WORK-PUB-15 km. This kind of inverse correlation makes sense because if there are high values of opportunities at the neighborhood, there will be a lower probability of individuals performing long trips (15 kilometers or more), mainly to work.

Family income and car ownership are also highly correlated variables. They are also correlated to travel categories related to car usage. The correlation matrix gives an idea of the general structure of the data and of the existing relations.

The initial factor extraction criteria were eigenvalues greater than one (latent root criterion). Thus, seven components were extracted, with 83% of variability explained from the original data. Thus, the 32 initial variables can be explained by the seven components.

To properly interpret and also name the factors or components, it is necessary to analyze the factorial load values of the variables. The factorial loads are the contribution (negative or positive) analyzed for each of the variables related to each of the seven components.

Component 1: individual’s socio-economic class

The first component should be the most important as it explains about 32% of the variability for the total data. High and positive factorial loads were observed for the variable CAR (number of cars at the household) and FI (familiar income).

How does socio-economic class relate to travel behavior? Assuming what was viewed in the literature review, higher income individuals use car more often. Thus, high and positive factorial values are seen for the likelihood

of travel categories of car usage, regardless of the travel purpose, as well as distance traveled. Also negative work trip values performed by public transportation were observed. It can be stated that Component 1 represents the individual socio-economic class that is strongly associated to the travel mode choice, especially to car usage.

Component 2: urban environment

The second component explains about 18% of data variability, especially characterizes the urban environment. There were high and positive values for the four variables related to the urban environment (cumulative opportunities by buffer distance).

How is the urban environment factor associated to travel behavior? In the literature predominates the statement that compact cities with well distributed mixed activities (high density) benefit from the use of non-moto- rized travel mode or lower distance travels. According to factorial loads observed in component 2, high and negative values for distances higher than 15 km are verified, in other words, the greater the number of cumulative opportunities, the lower the probability of individuals travelling more than 15 km to satisfy their travel purpose.

Component 3: age vs. non-motorized travel mode

Component 3 explains about 9% of the data, and it has a high and positive factorial load value for the age variable. However, high and negative values for categories related to non-motorized travel mode and travel distance lower than 5km were observed. Intuitively it is expected that older individuals have more travelling restrictions with the non-motorized travel mode, mainly when distances are longer than 1km, for example.

Component 4: school

Component 4 is associated with school travel categories. The variables related to these activities (mainly involved with study) would probably have high contributions to the referred factor.

Component 5: work

Component 5 is work-related trips. No high contribution from socioeconomic or urban environment was observed. Similarly to component 4, this component is important for variables related to work activities.

Variables | Components | ||||||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | 5 | 6 | 7 | |

Until 5 km | 0.42 | 0.70 | 0.00 | 0.37 | 0.27 | 0.05 | 0.08 |

Until 10km | 0.41 | 0.75 | 0.04 | 0.33 | 0.33 | 0.03 | 0.08 |

Until 15 km | 0.41 | 0.74 | 0.06 | 0.30 | 0.36 | 0.02 | 0.08 |

Until 20 km | 0.38 | 0.74 | 0.07 | 0.26 | 0.37 | 0.00 | 0.08 |

FAM | −0.24 | −0.02 | −0.08 | −0.12 | −0.13 | −0.02 | 0.71 |

CAR | 0.78 | −0.19 | −0.04 | −0.13 | 0.23 | 0.14 | 0.01 |

FI | 0.68 | 0.01 | 0.04 | −0.18 | 0.05 | 0.38 | 0.06 |

AGE | 0.33 | 0.05 | 0.65 | 0.12 | −0.24 | −0.07 | −0.32 |

II | 0.48 | 0.02 | 0.23 | −0.08 | −0.19 | 0.37 | −0.02 |

QCR | −0.17 | −0.06 | 0.06 | −0.01 | −0.24 | −0.12 | 0.72 |

ACT-CAR-5 km | 0.92 | −0.17 | 0.02 | 0.04 | 0.05 | 0.10 | 0.09 |

ACT-CAR-5 to 10 km | 0.27 | −0.54 | 0.14 | 0.33 | −0.02 | −0.66 | −0.02 |

ACT-CAR-10 to 15 km | 0.49 | −0.49 | −0.05 | 0.21 | 0.19 | −0.63 | 0.06 |

ACT-15 km | 0.69 | −0.24 | 0.13 | −0.02 | −0.06 | 0.49 | 0.11 |

ACT-PUB-5 km | −0.79 | 0.29 | −0.04 | 0.21 | −0.15 | 0.09 | −0.02 |

ACT-PUB-5 to 10 km | −0.34 | 0.72 | 0.15 | 0.23 | 0.44 | −0.04 | −0.01 |

ACT-PUB-10 to 15 km | 0.01 | 0.38 | 0.41 | −0.37 | −0.01 | −0.28 | 0.00 |

ACT-PUB-15 km | −0.58 | −0.58 | −0.01 | −0.05 | 0.03 | 0.16 | −0.05 |

ACT-NM-5 km | 0.00 | −0.15 | −0.85 | −0.09 | 0.19 | −0.04 | 0.02 |

SCH-CAR-5 km | 0.95 | −0.12 | 0.08 | 0.60 | −0.11 | −0.23 | 0.01 |

SCH-CAR-15 km | −0.16 | −0.75 | 0.22 | 0.65 | 0.05 | −0.04 | 0.05 |

SCH-PUB-5 km | −0.30 | −0.58 | −0.18 | 0.61 | 0.27 | 0.42 | 0.03 |

SCH-NM-5 km | −0.16 | −0.03 | −0.67 | 0.69 | 0.51 | −0.08 | −0.02 |

WORK-CAR-5 km | 0.95 | −0.14 | −0.05 | −0.07 | 0.02 | −0.23 | 0.01 |

WORK-CAR-5 to 10 km | 0.96 | −0.15 | 0.01 | −0.14 | 0.03 | −0.06 | 0.03 |

WORK-CAR-10 to 15 km | 0.93 | −0.11 | 0.05 | −0.08 | 0.02 | 0.02 | −0.01 |

WORK-CAR-15 km | 0.88 | −0.10 | 0.19 | −0.02 | −0.24 | 0.25 | 0.01 |

WORK-PUB-5 km | −0.60 | 0.47 | 0.34 | −0.02 | 0.65 | −0.06 | 0.05 |

WORK-PUB-5 to 10 km | −0.71 | 0.06 | 0.51 | −0.26 | 0.68 | −0.05 | 0.01 |

WORK-PUB-10 to 15 km | −0.62 | −0.20 | 0.52 | −0.20 | 0.40 | 0.04 | 0.02 |

WORK-PUB-15 km | −0.51 | −0.70 | 0.21 | 0.29 | 0.64 | 0.19 | 0.05 |

WORK-NM-5 km | −0.29 | 0.46 | −0.71 | 0.12 | 0.66 | 0.00 | −0.08 |

Component 6: other activities

Component 6 is exclusively associated with travel pattern categories related to other activities.

Component 7: family structure

Finally, component 7 is characterized by FAM variables (number of individuals in the family) and NCH (number of children per household). There is no contribution associated to any travel category. As it is considered the main travel, a strong relationship between family structure and travel behavior cannot be expected. Perhaps stronger relations could be found for the complexity of trip chaining, for example. Number of children can be associated to the number of travels per day. The higher the number of children, the higher the number of trips performed, per day.

The purpose of the second analysis applying PCA is to determine whether the original database can be synthesized in a set of factors/components. The dataset obtained can be used on subsequent analyses, especially using techniques that analyze the dependence between variables and suppose that there are no correlations within the independent variables.

In analysis 2, the variables related to travel pattern categories were excluded. Only socioeconomic and urban environment numerical variables were used, totalizing ten variables.

The latent root criterion was also used to extract the components. Next, three components with “approximate explanation” of 72% of data variability were obtained.

According to factorial load values, the three obtained components can be termed: 1) component 1: urban environment; 2) component 2: individual socio-economic class; 3) component 3: family structure. The ten variables can be synthesized in three components that represent the attributes which supposedly influence travel behavior. Then, the individual factorial load of these three components can be used on sequential analyses, especially when applying techniques where there are restrictions related to multicollinearity.

Through an exploratory analysis, this work seeks to better understand the factor that individuals (mainly workers) consider when choosing travel modes, distances and purposes. The components found are adequate to synthesize the data set in three factors which characterize (also according to the literature reviewed) the characteristics that influence travel behavior: 1) socio-economic class; 2) family structure; and 3) urban environment.

Analyzing the data structure (PCA application?analysis 1), in order to examine patterns or relations between travel variables and a large data set, enabled supporting the main hypothesis of this work.

Hypothesis 1: socio-economic class affects the modal choice and travel distances

Applying PCA in the first analysis, through component 1 the relation between socio-economic class (CAR and II), and travel behavior was observed. The higher the income or number of cars in the household, the higher the probability of choosing the car, thus lower the probability of using non-motorized travel mode or transit.

Hypothesis 2: variables related to the family structure affect travel behavior

Component 7 (analysis 1 of PCA), which features the family structure, had no apparent relationship with travel behavior. However, the family structure can influence the complexity of trip-chaining, or the number of travels performed, but not the main travel, which was represented in this paper.

Hypothesis 3: distribution and intensity of opportunities in the urban environment affect the destination choices.

Through component 2 (analysis 1 of PCA), one can concluded that the higher the number of cumulative opportunities in the residence zone, the lower the probability of individuals making long trips. Work trips, which are often scheduled with fixed locations, can affect the long term choices. Particularly in the São Paulo Metropolitan Area, many individuals choose to live close to their work locations, especially in regions with high

Comp | Eigenvalues | % of var | Acum % | Until 5 km | Until 10 km | Until 15 km | Until 20 km | FAM | CAR | FI | AGE | II | NCH |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 4.11 | 41.15 | 41.15 | 0.90 | 0.94 | 0.94 | 0.92 | −0.25 | 0.42 | 0.47 | 0.26 | 0.34 | −0.23 |

2 | 1.91 | 19.07 | 60.22 | −0.24 | −0.29 | −0.29 | −0.30 | −0.28 | 0.68 | 0.71 | 0.33 | 0.62 | −0.24 |

3 | 1.16 | 11.62 | 71.84 | 0.03 | 0.02 | 0.03 | 0.03 | 0.69 | 0.14 | 0.28 | −0.19 | 0.26 | 0.69 |

numbers of job offers.

This work was supported by the Conselho Nacional de Desenvolvimento Científico and Tecnológico (CNPQ). We also acknowledge the São Paulo Metro for providing the data.