^{1}

^{2}

^{3}

^{1}

^{1}

With the deployment of modern infrastructure for public transportation, several studies have analyzed movement patterns of people using smart card data and have characterized different areas. In this paper, we propose the “movement purpose hypothesis” that each movement occurs from two causes: where the person is and what the person wants to do at a given moment. We formulate this hypothesis to a synthesis model in which two network graphs generate a movement network graph. Then we develop two novel-embedding models to assess the hypothesis, and demonstrate that the models obtain a vector representation of a geospatial area using movement patterns of people from large-scale smart card data. We conducted an experiment using smart card data for a large network of railroads in the Kansai region of Japan. We obtained a vector representation of each railroad station and each purpose using the developed embedding models. Results show that network embedding methods are suitable for a large-scale movement of data, and the developed models perform better than existing embedding methods in the task of multi-label classification for train stations on the purpose of use data set. Our proposed models can contribute to the prediction of people flows by discovering underlying representations of geospatial areas from mobility data.

As location-based sensor devices and networks have been widely spread, a large amount of mobility data of users, which can be potentially used for several research purposes, has been accumulated [

Researchers have used such large amount of mobility data for the purpose of location-based recommendation such as personalization point of interest (POI) [

Modeling and predicting people flow in a specific area results in understanding the characteristics or roles of the area by combining activity patterns of people with external information about the area [

The basic notion of representation learning [

In this research, we aim to find latent representation of geographical areas using the representation learning technique. Such representation can be used for urban planning and regional development by revealing potential roles of geographical areas and their relations, which cannot be always observed from superficial information in mobility data. We can employ the notion of existing network embedding methods to find such representation from massive people flow data. However, one cannot simply apply existing embedding methods to our problem of embedding geospatial areas. For people movement in a large network of transportation systems such as railroads, several geographical constraints exist on their movement. For example, I, who live in Tokyo, do not go to Osaka to shop for daily necessities; I always buy daily necessities nearby and I don’t go all the way to far away with trivial things. Therefore we can assume that people usually tend to minimize their movements depending on their activities, given some available means of transportation at their current location. We define such geographical constraints as the “movement purpose hypothesis.” If we consider geospatial areas as a network connected with links of people with movement patterns between areas, and if we then try to embed the network in a low-dimensional vector space to obtain representations of areas, we have to consider such geographical constraints on movement of people in the real world.

In this paper, we propose a novel embedding method to obtain a vector representation of a geospatial area using movement patterns of people from large-scale smart card data. Our proposed method consists of two embedding models, which are the “concatenating model” and the “internally dividing model,” based on the movement purpose hypothesis. We conducted an experiment using massive smart card data in a large network of railroads in the Kansai region of Japan. We obtained a vector representation of each railroad station using the proposed embedding models and evaluate it in the task of multi-label classification for railroad stations. We demonstrate that our proposed models work well on actual massive mobility data from smart cards of the rail roads. Our proposed method can identify stations in a large network of railroads, which are geographically distributed but share similar characteristics or roles in the region. Therefore, we can support a city planner, a marketer, and a policy maker to design their strategies or implement their policies for regional development by providing potential characteristics of geographical areas and their relations.

Our contributions in this paper are four-fold:

1) We propose the movement purpose hypothesis and develop novel-embedding models to obtain a vector representation of a geospatial area using movement patterns of people.

2) We demonstrate that our developed models work well using actual large-scale mobility data from smart cards of the railroads in Japan.

3) We also demonstrate that our proposed models can successfully identify stations, which are geographically distributed but share similar characteristics or roles.

4) According to the results of parameter estimation of our proposed embedding model, we find that the purpose of visit for a station is 1.1 times more important than the geographical distance between stations for people movement in a large network of railroads.

Our work is mainly related to mobility data analysis and network-embedding learning. In this section, we discuss our research position and novelty in relation to existing related works.

Recent sensor networks and infrastructures for public transit such as automated fare collection (AFC) systems with smart cards have supported the collection of large volumes of mobility data including people’s activities with detailed time and space information. In particular, mobility data from the AFC systems are currently used for several purposes such as visualization [

Moreover, aiming at several applications for location-based services including a personalized point of interest (POI) recommendation for users [

Recent studies have analyzed movement patterns of people from one area to another using smart card data and have characterized the areas or have enabled segmentation of the areas [

In this research, we aim to find latent representation of geographical areas using the representation learning technique. We can employ the notion of existing network embedding methods to find such representation from massive people flow data. The network embedding method comes from graph theory and linguistics word embedding methods. In the context of the graph theory, adjacency matrix factorization techniques like singular value decompositon (SVD) and non-negative matrix factorization (NMF) are the prototype [

Network embedding has been further developed for time series analysis [

This section first describes the “Movement Purpose Hypothesis” which the people flow is caused from geolocation and purpose. Next we explain how the network form from massive people flow and the necessity of label propagation on the network. We extend propagating labeled network embedding model for massive people flow data. Finally, we propose models based on the hypothesis and explain precisely.

We propose the “movement purpose hypothesis,” as shown in

There are three graphs that do not mutually share their vectors, the people flow graph (

We interpret this equation as two types: “concatenating model” and “internally dividing model.” For the “concatenating model,” we interpret the operator “+” as connecting two vectors and producing a new vector with dimensions that are twice as numerous as the number of dimensions of each vector, not that we add each element in the two vectors (

Based on the concatenating model, vector representations are acquired by the learning algorithm shown in

In this equation,

We set this objective function for three networks individually and derive update

1: | Input: |
---|---|

2: | Output: |

3: | Initialize each vector |

4: | for |

5: | Sample an edge |

6: | Load |

7: | Update |

8: | Overwrite the corresponding part of |

9: | Sample an edge |

10: | Load |

11: | Update |

12: | Overwrite the corresponding part of |

13: | Sample an edge |

14: | Update |

15: | END |

equations by differentiating them with respect to the each vertex vector (

For the “internally dividing model,” the node vector in the people flow graph (

This equation models that people decide the destination place in consideration of both the physical place relation and the purpose they want to accomplish there.

We set the objective function as in the Equation (2) for each graph. However, when updating the vector

In these equations,

As described in this paper, our proposed models need three networks: the people flow network, the geographical constraints network and the purpose proximity network. To arrange these three networks as input, we apply three datasets for the experiment. In this section, we explain these three datasets and the arrangement.

Getting on and off dataset for the people flow network: This dataset includes massive smart card data for the Japan Kansai region (southwestern half of Japan, including Osaka). This dataset has passenger’s smart card log provided by six railway companies. The providers have anonymized this dataset. The dataset contents mainly consist of six elements: each user of the gender, age, getting on and off date and time, and boarding and destination station. The summary of this dataset is shown in

Train route map dataset for the geographical constraint: This time, we apply the train network information as geographic proximity information obtained through the Japan train line API^{1}. We construct the train route map through this. The graph is undirected and the weights of all edges are equal and the route map is shown in

Purpose of use dataset for the purpose proximity network: This paper is intended to estimate each station’s role. As described herein, we produce a dataset using the results of the person trip survey. In Japan, the Ministry of Land, Infrastructure and Transport takes a nationwide survey through questionnaire from many persons every decade. We apply the 2010 results^{2} to our experiment, which includes how much people come to each station for what purpose. The purposes of the getting off each station are “commuting to work”, “commuting to school”, “going home”, “on business”, and “others”. A summary of this dataset is shown in

Starting date | April 01, 2013 |
---|---|

Ending date | April 30, 2013 |

Total number of records | 68,763,457 |

Number of unique users | 3,679,251 |

Number of station varieties | 672 |

Number of railway companies | 6 |

Purpose | Number of users |
---|---|

Commuting to work | 1,278,288 |

Commuting to school | 349,234 |

Going home | 2,192,826 |

On business | 358,891 |

Others | 1,313,767 |

Total | 5,493,006 |

Number of stations | 599 |

people flow network, we select only weekday morning data. So, we do not use the a “going home” purpose in this dataset and use the remaining four purposes, because we think that most people do not return home in the morning.

In this section, we evaluate the effectiveness of the developed models for geospatial data. For this purpose, we compare various algorithms and conduct an experiment. As reported below, we describe the results.

As described in this paper, we conducted a multi-label classification experiment because the purposes of dropping off passengers at a station are plural. To be exact, purposes will differ from person to person. We regard a station as a probability distribution of some purposes and estimate it in the experiment.

The experimental procedure is the following. First, obtaining the vector representation using the listed methods in Section 5.2. Second, the training classifier for each experiment using training labeled data set made from a part of the purpose of use dataset. Finally, we conduct a prediction evaluation using test data produced from the rest of the dataset and evaluate the obtained result using some measurements.

For multi-label classification, we use a multiclass logistic regression classifier. We use the LIBLINEAR package^{3} as the classifier. We use three measurements for the multi-label classifications, which are the “KL divergence”, the “Mean Reciprocal Rank” (MRR), and the “Mean Average Precision” (MAP). For this experiment, the number of classes is four described in Section 4. We evaluate the method accuracy using two cross- validations randomized five times repeatedly. In other words, we use all the getting on and off data for obtaining vector representations, but we use only half of the stations in the purpose of use dataset for obtaining vector representations and classifier training. We use the rest to evaluate the classifier accuracy. We repeat this experiment procedure five times by randomizing the purpose of use dataset.

Finally, we evaluate geographical locations around each purpose vector. The evaluation metric is the average value of the standard deviation of the actual geolocation of stations near the purpose label vector. Because, when the average of the standard deviation of the nearby station of the purpose label is large, the station group is extracted for the purpose of the station without geographical constraints.

We use the following methods to compare algorithms.

1) Weighted random: random sampling from a discrete probability distribution. In advance, we calculate each purpose distribution from a training dataset. When predicting the purpose of dropping off at a station in test data, the method predicts it by selecting a purpose randomly according to the arbitrary distribution.

2) Word2Vec [

3) GloVe [

4) DeepWalk [

5) LINE [

6) PTE [

7) Proposed: Our proposed models are all for learning geospatial area embedding through large-scale mobility data from smart cards. We offer two models based on the “Movement Purpose Hypothesis” described in the Section 3.1, which is the concatenating model (“concat”) and the internally dividing model (“divide”). Our proposed models can embed vertices in three network graphs to different vector spaces. A single node in different graphs has different vector representations with each graph (

Word2Vec and GloVe are necessary for sentences as input information because of word embedding methods. We regard the sequence of stations which is history of each user getting on and off as a sentence. Word2Vec, GloVe, DeepWalk, and LINE methods are unsupervised style learning. Therefore, we merely apply user information related to getting on and off at different stations for training. PTE and our proposed models are semi-supervised style learning. We set the people flow network as the word- word network, the geographical constraints network as the word-document network, and the purpose proximity network as the word-label network. On all method, the dimension of the node vector is set as 200, but in the proposed concatenating model,

This section presents the performance and characteristics of our proposed models.

Next, we compare the performance of GloVe with others. GloVe indicates the best result at the KL divergence metrics because only GloVe uses global co-occurrence information in the dataset. The effect of long context co-occurrence information also shows the result between LINE (1st) and LINE (2nd). Although LINE (1st) directly approximates the edge weight between two nodes, LINE (2nd) approximates two hops sharing node proximity. This effect appears in the KL divergence and the MRR result. These results indicate that using the global graph structure is good for multi-label classification.

We compare our proposed models with the PTE (joint) method. Particularly, the proposed model of (divide

Finally, we make a comparison of our proposed models. The proposed (divide

Regarding

Next, it is necessary to unveil the obtained purpose vector characteristics. Therefore, we inspect stations around each purpose vector. As described in this paper, we attempt to extract purposes of a station to go accurately and the purposes of a station are irrelevant to the geospatial location of the station. If so, our proposed method will gather distant stations one after another around a purpose vector. We evaluate this hypothesis to

Method | KL div. | MRR | MAP |
---|---|---|---|

Weighted random | 40.734e−2 | 45.000e−2 | 74.341e−2 |

Word2Vec | 38.810e−2 | 44.973e−2 | 74.313e−2 |

GloVe | 36.187e−2 | 47.802e−2 | 73.608e−2 |

DeepWalk | 39.496e−2 | 45.192e−2 | 74.176e−2 |

LINE(1st) | 40.006e−2 | 45.000e−2 | 74.341e−2 |

LINE(2nd) | 37.796e−2 | 51.209e−2 | 73.553e−2 |

LINE(concat) | 37.560e−2 | 51.071e−2 | 73.343e−2 |

PTE(joint) | 39.417e−2 | 45.192e−2 | 74.368e−2 |

Proposed (concat | 40.425e−2 | 45.000e−2 | 74.341e−2 |

Proposed (concat | 38.766e−2 | 45.082e−2 | 74.341e−2 |

Proposed (concat | 38.933e−2 | 45.137e−2 | 74.368e−2 |

Proposed (divide | 38.606e−2 | 45.000e−2 | 74.341e−2 |

Proposed (divide | 39.051e−2 | 45.852e−2 | 74.167e−2 |

Proposed (divide | 37.216e−2 | 51.511e−2 | 72.610e−2 |

Variable | Average | Std dev. |
---|---|---|

α | 0.468 | 0.0153 |

confirm the standard deviation of station geolocation. The result is presented in

Comparison of the proposed (divide

Finally, we present an illustrative visualization of each method. We present a visualization in

The (a) DeepWalk vector forms clusters gathering at each company. In this people flow dataset, we found from statistically results that most people usually move in a small area and they do not transfer so much. Therefore, the DeepWalk visualization result is reasonable because it captures local context information. This result also shows (e) proposed (divide

In (f) proposed (divide

As described in Section 5.3, our proposed models achieve better results than the PTE

Method | “On business” | “Others” | “To work” | “To school” | Average | |||||
---|---|---|---|---|---|---|---|---|---|---|

long SD | lat SD | long SD | lat SD | long SD | lat SD | long SD | lat SD | long SD | lat SD | |

PTE (joint) | 4.052e−2 | 6.884e−2 | 6.283e−2 | 8.801e−2 | 4.991e−2 | 8.412e−2 | 12.866e−2 | 9.020e−2 | 7.048e−2 | 8.280e−2 |

Proposed (concat | 6.516e−2 | 8.739e−2 | 4.804e−2 | 7.941e−2 | 5.530e−2 | 8.124e−2 | 6.399e−2 | 7.714e−2 | 5.812e−2 | 8.129e−2 |

Proposed (concat | 6.126e−2 | 7.224e−2 | 7.586e−2 | 7.782e−2 | 6.912e−2 | 7.102e−2 | 10.655e−2 | 8.183e−2 | 7.820e−2 | 7.573e−2 |

Proposed (concat | 3.584e−2 | 7.897e−2 | 4.025e−2 | 7.259e−2 | 3.498e−2 | 7.174e−2 | 7.048e−2 | 7.869e−2 | 4.539e−2 | 7.550e−2 |

Proposed (divide | 10.596e−2 | 9.604e−2 | 9.551e−2 | 9.603e−2 | 9.593e−2 | 10.133e−2 | 8.789e−2 | 8.831e−2 | 9.632e−2 | 9.543e−2 |

Proposed (divide | 11.701e−2 | 9.416e−2 | 14.268e−2 | 11.357e−2 | 11.731e−2 | 11.242e−2 | 14.934e−2 | 15.349e−2 | 13.159e−2 | 11.841e−2 |

Proposed (divide | 11.701e−2 | 9.416e−2 | 14.268e−2 | 11.357e−2 | 11.731e−2 | 11.242e−2 | 14.934e−2 | 15.349e−2 | 13.159e−2 | 11.841e−2 |

method. These results indicate that, for large-scale movement data, which have spatial dependence, the proposed models capture the characteristics of the purpose of each area better than the PTE method does. People’s moving areas are usually small. They live in defined areas. In light of this constraint, our proposed models work better than the PTE method. For the multi-label classification task, our proposed models (concat

However, the currently proposed models’ performance is slightly better than unsupervised embedding methods because our proposed models use only two-hop proximity, and they do not capture the global network structure. As the next step, we should consider a graph global structure with the heterogeneous network and how to apply the labeled network more efficiently. The graph global structure can be captured by the GraRep [

We believe that there is considerable research room to representation learning for the geospatial network.

Ochi, M., Nakashio, Y., Ruttley, M., Mori, J. and Sakata, I. (2016) Geospatial Area Embedding Based on the Movement Purpose Hypothesis Using Large-Scale Mobility Data from Smart Card. Int. J. Communications, Network and System Sciences, 9, 519-534. http://dx.doi.org/10.4236/ijcns.2016.911041