^{1}

^{*}

^{1}

^{1}

Trajectory data set is the indispensable foundation for constructing reliable Internet of Vehicles (IoV) service and location-based service (LBS), while it is likely to be abused by malicious attackers to infer user’s privacy. In this paper, we propose a trajectory protection method based on stop points obfuscation, which can confront various privacy attacks and preserve the semantic information to achieve adequate utility. Two strategies for stop point selection are designed, including category-distance priority method and Markov matrix method. Our new method was analyzed and evaluated on a real-world trajectory data set. The experiment result shows that our method can improve the utility of the data set and provide multi-level privacy protection.

With the development of the Internet of Vehicles (IoV) and the popularization of intelligent positioning technology, many new in-vehicles service applications are designed to assist transporting [

However, if the raw data collected from the vehicles or user’s personal terminal device is published without proper treatment, it might lead to serious privacy risk. A malicious attacker can query the public data set and extract specific victim user’s personal information, including occupation, working location, lifestyle, and even interpersonal network [

While existing approaches have contributed a lot based on the statistical features, researchers usually gave less consideration on the semantic trait, i.e., the users’ movement and daily activity patterns carried by trajectory. Though forward-mentioned approaches can generate fake trajectories with high entropy to obfuscate the data set, they inevitably lead to semantic loss and significantly reduce the utility. Specifically, the data set could not reflect the correct and effective user movement patterns after desensitization. If such a data set is used in machine learning applications, it will produce a defective model which gives biased results.

From the perspective of pattern recognition and privacy attack, the semantic information of trajectory data is mainly carried by stop points, where users spent longer time [

Based on this observation, we proposed a semantic sensitive privacy protection algorithm for trajectory data. First, we extract the stop point information from every trajectory data entry, and determine their Point-of-Interest (POI) description by querying LBS database. Second, the stop points will be categorized into multiple types, and we compute the Markov matrix among these stop point categories, which represents the probability of the transition from one category of stop point to another. Then the matrix can be used to guide the obfuscation process. For each stop point on a trajectory, we can select the obfuscated stop point with two strategies: category-distance priority method or Markov matrix method. After the stop points are obfuscated, the algorithm will generate the remaining middle points according to the shape of origin trajectory. At last, the algorithm will verify whether the new trajectory shows high-fidelity statistical trait.

The main contribution of this paper is summarized as three points:

• We propose a semantically sensitive privacy protection algorithm for trajectory data set. The new algorithm can cloak the users’ personal information and confront known privacy attacks, while simultaneously preserving the essential semantic information.

• We design two stop point obfuscation strategies: category-distance priority method and Markov matrix method. The advantages and applicable scenes of each are studied. Data providers can adapt one of the strategies flexibly according to specific scenarios to provide heterogeneous protection levels.

• We evaluate the new algorithm on a real-world data set and compare the results with the existing dummy-based algorithms. The experiment results show that our algorithm can better serve the utility requirements, meanwhile provide multi-level protection for trajectory data.

The rest of this paper is organized as follows. In Section 2, we discuss the related works. Preliminaries are presented in Section 3. Section 4 is where we explain the new algorithm in detail and Section 5 contains the evaluation of the algorithm. We discuss the limitation of our work and a few interesting findings during the research in Section 6 and finally conclude the paper in Section 7.

The privacy risks on public trajectory data set have drawn much attention from researchers. Re-identification attack is a kind of privacy attack that could identify a specific user from a large data set. In this threat model, the attacker already obtains enough background knowledge about the victim, and he could re-identify the victim’s daily trajectory by querying the public data set in a targeted manner. Pellungrini et al. [

A more sophisticated attack called semantic attack was proposed by Sui et al. [

Many methods have been proposed for data protection, including: K-anonymity approach, suppression approach, machine learning approach, and dummy-based approach. K-anonymity [

Suppression is another typical idea for privacy protection, Sweeney [

More recently, Shaham et al. [

Apart from the aforementioned methods, many dummy-based privacy protection approaches were proposed, which require less computing resources and can run smoothly on a large data set. Kato et al. [

In summary, the previous protection methods could be configured to meet the privacy requirement of many application scenarios, but most of them did not consider the semantic information carried by the trajectory data, therefore the utility of data sets were seriously lost after the protection process. Also, the approaches which involve clustering operation is not suitable for the sparse and gigantic data set. After realizing the previous shortcomings, our new method is designed with the care of the semantic information and the operating efficiency.

Definition 1: Trajectory Data

A trajectory data entry is consist of a sequence of spatiotemporal 3-dimensional tuples and user identifier, which can be represented as follows:

T = { ( x 1 , y 1 , t 1 ) , ( x 2 , y 2 , t 2 ) , ⋯ , ( x n , y n , t n ) , U i } (1)

where ( x i , y i ) is the location point and t i is the timestamp; U i is the user identifier. A trajectory data set is consists of a number of trajectory data entries, noted that there can be many trajectory data entries linked to a unique user.

Definition 2: Location Semantic Data

In our context, the semantic information of a location point refers to the point-of-interest (POI) description, which can be queried from a LBS database or a LBS open API. We assume that each location ( x i , y i ) correspond to a unique POI description. All POI information can be classified into multiple levels of categories, corresponding to semantic information of different granularities. For example, here is a POI description from the LBS database:

(Jeff’s cuisine, Chinese restaurant, Catering).

The first element in the tuple is the description of this POI, the second element is second-level category information, the third element is the first-level category information. It is a hierarchical directory structure where many second-level categories locate under the first-level. For example, under the Catering category, there is second-level category of Chinese restaurant, foreign restaurant, coffee shop, bar, and so on. The hierarchic category information makes it convenient to quickly locate the target POI and we can leverage this feature to achieve multiple levels of privacy protection.

Definition 3: Markov Matrix

A Markov matrix can be used to represent the stops in a Markov chain [

P = [ P 1,1 P 1,2 P 1,3 P 2,1 P 2,2 P 2,3 P 3,1 P 3,2 P 3,3 ] (2)

Markov matrix has been widely used in location-based machine learning tasks, like activity classification [

Definition 4: Trajectory Movement Similarity

The movement similarity between two trajectories can be calculated by the sum of rotation angles on each pivot point [

σ a , b = ∑ j = 1 k | θ a i − θ b i | (3)

where k is the number of pivot point, and θ a i represents the angle of ith pivot point on trajectory a. The pivot angle θ can be calculated by:

θ = arccos ( L i − 1 L i ⋅ L i L i + 1 | L i − 1 L i | ⋅ | L i L i + 1 | ) (4)

in which L i is ith location point on the trajectory, so L i − 1 L i and L i L i + 1 represent a pair of consecutive direction vector.

Definition 5: Semantic Utility

The semantic utility metric is used to measure the utility of the dummy trajectory, it is defined as:

w = ∑ i = 1 k f ( S i d u m m y , S i r e a l ) k (5)

where S i d u m m y is the POI description of ith stop point on the dummy trajectory and S i r e a l is the corresponding one on real trajectory. f is a matching function to determine the similarity between two POI descriptions. It can be configured for multiple category levels discussed in Definition 2:

f j ( S 1 , S 2 ) = { 1 , S 1 . c a t j = S 2 . c a t j 0 , S 1 . c a t j ≠ S 2 . c a t j (6)

We use the notation S i . c a t j represent the jth level category attribute of S i . Following is an example for explaining how the matching function works. Assume there are two POI description S_{1} and S_{2}:

S_{1} = (Jeff’s cuisine, Chinese restaurant, Catering);

S_{2} = (Starbucks, Coffee shop, Catering).

If the matching function is configured to match first-level category, then f 1 ( S 1 , S 2 ) = 1 , because both first-level POI description is Catering. But if it is configured for second-level category, then f 2 ( S 1 , S 2 ) = 0 . This design allows data provider to make balance between the utility and protection level by configuring which matching function to use.

In this section, we are going to show the detail of the new trajectory protection algorithm. There are three main procedures in the algorithm: preprocessing, dummy trajectory synthesis, and trajectory correction.

The purpose of the preprocessing step is extracting the semantic information from the trajectory data set and builds a comprehensive model to guide the synthesis. The preprocessing step involves trajectory data of every user by default, so it can extract the commonness of human behavior patterns from the daily movement. Alternatively, it can be configured to run on a small group of users to enhance the protection level for certain members.

Stop point detection, as known as stay point detection, is a well-studied research topic [

As we have discussed in Section 1, stop points can reveal the visiting purpose and users’ behavior patterns. In this step, it will link each detected stop point to corresponding POI information, which could be accomplished by querying a public LBS database or LBS API.

Most LBS databases and LBS APIs organize the POI data in hierarchy structure for better utility. In this procedure, the data provider can follow the existing category model of the LBS database, or adopt a new classification model based on specific application scenarios. The aim is categorizing the POI description and attach them to each detected stop point in following format: ( x , y , t , d e s c , c a t 1 , c a t 2 ) , where desc is the detailed name of POI, c a t 1 and c a t 2 is first-level category and second-level category title respectively.

The introduction of Markov matrix is to establish a simple but effective semantic-sensitive model to guide the obfuscation of stop points. We assume that users choose their next stop point only depending on their current location so that we

First-level category | Second-level category |
---|---|

Catering | Chinese restaurants, Foreign restaurants, Fastfood restaurants, Dessert shops, Coffee shops Bars, Others |

Estate | Office building, residential areas, Dormitories, Internal buildings, Others |

Company | Company, Office park, Agriculture and gardening, Factory and mines, Others |

Shopping | Shopping mall, Store, Supermarket, Convenience store, Home building material, Digital home appliance, Agricultural market, Others |

can compute the transition probability among different POI categories to extract users’ behavior patterns. For the ease of demonstration, we use first-level category to explain the procedure. Assume that all stop points can be divided into k first-level category: c 1 , c 2 , ⋯ , c k , hence the size of transition matrix M is k × k . A trajectory is decomposed into a set of movements among stop points, so the stop points on the trajectory can be reorganized into source-destination pairs:

s r c : ( x i , y i , t i , d e s c i , c a t i 1 , c a t i 2 ) → d s t : ( x i + 1 , y i + i , t i + 1 , d e s c i + 1 , c a t i + 1 1 , c a t i + 1 2 ) .

Each item in the matrix M can be computed by:

M i j = P r ( d s t = C j | s r c = C i ) = P r ( d s t = C j , s r c = C i ) P r ( s r c = C i ) (7)

The value on M i j refers to the probability that, a user is currently stay on a stop point of C i and he will move to a C j stop point on next moment. Such a Markov matrix provides data support for selecting appropriate stop points and contributes to configurable privacy protection level in the later trajectory synthesis procedure.

This part is the core procedure of our new protection method. The dummy trajectory data generated from this step can replace the origin data, or append to the data set to achieve K-anonymity. Two major steps are designed for the generation, the first step is stop points obfuscation and the second step is middle point generation.

Most of the semantic information from a trajectory is carried by the stop points,

which are also considered to be the backbone of the trajectory data. Therefore how to obfuscate the stop points for privacy protection while not losing too much utility is a crucial but challenging problem. Our design is illustrated in listing Algorithm 1.

The workflow first obfuscates the start points with category-distance priority (CDP) method (line 2), then detect the stop point on the trajectory and perform obfuscation for each one with pre-configured method (line 4 - 10). The principle of two obfuscation methods and their details are discussed as follow:

Category-distance Priority (CDP) Method: The majority of trajectory-related machine learning tasks for IoV service, e.g. activity recognition [

Firstly, we determine the initial radius parameter r, and the category level j is chosen based on the privacy protection level (line 1). Secondly, the original stop point p is used as the center and r as the radius to search all POIs within the circle area from the LBS database (line 2). Thirdly, the result POIs will be sorted based on category-distance, the POI from the identical category has a higher

Algorithm 1. Stop point obfuscation.

Algorithm 2. Category-distance priority method (CDP).

rank. If a candidate of the same kind is found, then it will be output as result (lines 4 - 12). If no suitable POI is found within the range, then radius r will be increased, and perform another search (line 14). Finally, if r ≥ r max but still could not find a POI of the same kind, we directly choose the POI with minimum distance from p as result (line 17).

This method is straight-forward and effective, it is able to retain the semantic information to the maximum extent, but it does not rule out the case that an attacker can deobfuscation with abundant external background knowledge. Therefore we designed another method for stronger protection.

Markov Matrix (MM) Method: In the Markov matrix we obtained from the preprocessing procedure, each row represents the probabilities that from the category of the current position to other categories on next position. Based on this feature, a direct idea is to locate the category which has the highest transition probability and search for the next POI from the target category. But from the experiment result, we found that because most of the user’s activities are carried out with the residence as the origin, so in the generated Markov matrix, almost every row has the highest probability with the “estate” category.

If we chose the highest probability category directly as the target category for the next POI, then the new trajectory will only contains stop points from residential area, which is obviously inappropriate. In order to simulate the behavior patterns of genuine users, we use a weighted random function to select the category of next POI, which is denoted as weighted_random(C, P). C is the list of all categories and P is corresponding probability for each category, i.e., the probability from each row of the matrix. After running sufficient times, the random number sequence generated by this function will meet the input probability distribution. The Markov matrix method works as described in Algorithm 3.

The workflow is generally similar to Algorithm 2, except that the target category is obtained from the weighted random function (Line 7). Also, the procedure is enclosed by a while loop (Lines 2 - 16). From the experiments, we conclude that it can always generate a POI from weighted_random function and break the loop with appropriate radius r.

Once the stop point’s obfuscation is finished, we can obtain a simplified but semantic sensitive dummy trajectory, which preserves most of the semantic information from the original trajectory but the privacy has been desensitized. Our aim is to get a high-fidelity trajectory, so it is also necessary to handle the middle points. In our context, middle points refer to the location points other than stop points on the trajectory. They carry less semantic information, but subtly affect the comprehensive characteristics like the overall shape and movement speed. We adopt an improved version of the algorithm from [

The generation algorithm iterate over every location on the trajectory, if the location is a stop point, that means it has been secured by the obfuscation in last step, so we skip it (lines 2 - 4). As it is showed in

Algorithm 3. Markov Matrix Method (MM).

Algorithm 4. Middle points generation.

as radius, and then pick a location to add to candidate set for every θ degree. dest ( p , d , θ ) is a function that determine a destination location from origin p with distance d and rotation angle θ .

In this procedure, θ and r a n d o m can be configured to adjust the privacy protection level. The larger these values are, the dummy trajectory will have less similarity to the original one, thus better protection performance can be achieved.

The last step is correcting the generated trajectory. First, we need to check the location number and update the timestamps on the dummy trajectory. Then we will compare whether the movement trend of the dummy trajectory is consistent with the origin.

The correction procedure will first ensures that the location point numbers of T^{o} and T^{d} are identical (line 1). Then it adds up the original timestamps with random_shift and assigns them to corresponding location point on T^{d} (lines 2 - 4). The random_shift should locate in a certain range which is appropriate for the configured protection level. Line 5 and 6 are computing the slope, i.e., the movement trend of two trajectories, if the difference of slope is larger than a threshold, then the dummy trajectory should be dropped and re-generate from step B-2: middle points generation. The assert keyword in Algorithm 5 represents a condition check, if the condition fails, it should return back to the middle points generation step.

The experiments were conducted on a desktop with AMD Ryzen7 1700 CPU with 16GB RAM, which runs Ubuntu Linux LTS 18.04. We implemented the algorithm in Python3 with Jupyter Notebook.

Geolife data set (version 1.2.2) was collected by Microsoft Research Aisa and published in 2016 [

In the new algorithm, it requires a LBS database or LBS API for querying the POI description. Since Geolife data set only contains the location information (latitude, longitude, timestamp), we chose the Web service API provided by Baidu LBS SDK [

Algorithm 5. Trajectory correction.

From the definition of semantic utility (Equation (5)), we define the utility loss as:

u = 1 − ∑ i = 1 k f ( S i d u m m y , S i r e a l ) k (8)

Therefore, utility loss u will be valued between 0 and 1, suggesting the percentage of how much semantic information has lost for the new trajectory compared to the original data entry from the database. For comparison purpose, we select a typical dummy-based trajectory protection method proposed in [

completely. This figure suggests the data set suffers from significant utility loss after the DSC protection treatment, because it does not consider the POI descriptions.

As a conclusion for this evaluation, our new algorithm can preserve most semantic information after the privacy protection process. It can reduce the utility loss significantly when comparing to previous methods. Moreover, the category-distance priority method and the Markov matrix method have almost identical effect on semantic preservation.

Another important evaluation criterion for privacy protection algorithm is similarity, data provider usually need to make a trade-off between utility and similarity. A metric for comparing the similarity between two trajectories is defined by Equation (3) in Section 3.2. While there are many other methods to measure the similarity, this equation computes the difference of rotation angle on every pivot point. The larger the similarity value σ means less similarity between two trajectories, and it is more difficult for the attackers to recover the original trajectory.

In this evaluation, we chose the trajectories which contain less than 1000 location points from the data set, and generate a dummy trajectory from each of them. Then the similarity is computed between the original trajectory and the dummy trajectory. Other parameters for the algorithms are set to the same as we discussed in Section 5.2. The results are properly processed and drawn into box plots in

As it is shown in the graph, the trajectories which contain fewer location points have a wider range of similarity.

that the Markov matrix method can generate more condensed data for the trajectories with larger location points number (above 600).

In general, our new method can generate dummy trajectories with a higher σ value, and the performance is guaranteed on all lengths of trajectories. This evaluation demonstrated that our methods can provide better protection against the attackers in terms of the trajectories recovery.

In this experiment, we simulate the privacy attack method proposed in [

User id | Distance (DSC) [ | Distance (CDP) | Distance (MM) |
---|---|---|---|

2 (Home) | 59.9 m | 18.5 m | 174.2 m |

2 (Work) | 26.5 m | 45.4 m | 208.7 m |

5 (Home) | 17.0 m | 22.7 m | 21.0 m |

5 (Work) | 6.3 m | 19.9 m | 36.3 m |

16 (Home) | 188.8 m | 12.8 m | 40.1 m |

16 (Work) | 167.0 m | 44.7 m | 9.3 m |

22 (Home) | 139.6 m | 36.2 m | 0 m |

22 (Work) | 59.0 m | 219 m | 262.4 m |

37 (Home) | 327.1 m | 46.0 m | 178.6 m |

37 (Work) | 37.0 m | 60.0 m | 48.3 m |

43 (Home) | 56.3 m | 54.5 m | 19.3 m |

43 (Work) | 43.2 m | 32.5 m | 42.9 m |

78 (Home) | 54.0 m | 229.9 m | 73.3 m |

78 (Work) | 407.5 m | 275.8 m | 50.8 m |

104 (Home) | 24.1 m | 144.6 m | 103.6 m |

104 (Work) | 35.8 m | 163.1 m | 105.7 m |

111 (Home) | 110.2 m | 21.3 m | 106.8 m |

111 (Work) | 315.6 m | 333.7 m | 336.6 m |

168 (Home) | 75.6 m | 68.5 m | 1263.3 m |

168 (Work) | 606.8 m | 609.4 m | 51.0 m |

replace the original one, thus a fallback happens. It is believed that such distances can thwart the re-identification attack based on the accurate location background knowledge.

In this section, we will discuss a few extended topics, including some limitations of our method. The first topic is how to adapt the new method to various application schemes. The new method is flexible, and data providers can design various protection strategies according to their service features. The data provider can run our new method on every data entry from the data set and obfuscate important stop points. This strategy can preserve major semantic information and obfuscate sensitive privacy details, it is appropriate for machine learning tasks, such as traffic forecast and user activity patterns recognition. The data provider can also run the algorithm on each data entry multiple times to generate k − 1 dummy trajectories and add to the data set. This strategy can achieve K-anonymity and increase the data volume in the data set. It is effective to prevent re-identification attacks and suitable for the application scheme which provides an API for users to query the data.

In terms of the limitations, since the new method rely on the LBS database to extract the semantic information of the trajectories, therefore the accuracy of the LBS database will remarkably affect the protection result. If the result from the LBS database is not accurate enough, it will lead to plenty of fallback situations during the stop point obfuscation procedure, in such situations the algorithm will simply choose the original stop point to avoid damaging the crucial semantic information. Likewise, the protection performance for the trajectory in rural areas or the areas with sparse POI information is not as good as those in urban areas, because these areas do not contain enough semantic elements to cloak the genuine movement data. However, because most applications involved the trajectories data set are revolve around city daily lives, so we believe that the new methods can fulfill the protection requirements of most application schemes.

In this paper, we survey previous trajectory data privacy protection schemes and conclude that most of the proposed methods did not consider the semantic features of the trajectories data. However, many IoV service involves machine learning heavily rely on the semantic information carried by the trajectories data to generate reliable models. To address this issue, we proposed a new trajectory protection method, which is divided into three main procedures: preprocessing, dummy trajectory synthesis, and trajectory correction. Based on the observation that most semantic information is implied by the stop points, we design two methods for stop points obfuscation, which are category-distance priority method and the Markov matrix method. Among them, Markov matrix method leverages the idea of the transition probability matrix and achieves better protection effects. We evaluate the new algorithm on the real-world data set Geolife, the experiment results show that the new algorithm has great advantages on utility loss and similarity metrics comparing to the previous representative dummy-based protection method. In addition, we perform a simulated re-identification attack, the result shows that the new methods can protect sensitive privacy information, e.g. home address and workplace location with proper obfuscation. We hope that this work can mitigate the conflict between the privacy requirement and data utility of trajectory data sets, and urge manufacturers to pay more attention to privacy protection in IoV services.

Zhijian Shao was partially supported by Key R&D Program of Guangdong Province (Grant No. 2019B010136003), Natural Science Foundation of Guangdong Province, China (Grant No. 2019B010137005, 2018A030313387). Bingwen Feng was partially supported by National Key R & D Plan of China (Grant No. 2017YFB0802203), National Natural Science Foundation of China (Grant No.61802145), Science and Technology Program of Guangzhou, China (Grant No. 202007040004, 201804010428), the Fundamental Research Funds for the Central Universities, the Opening Project of State Key Laboratory of Information Security, the Opening Project of Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security.

The authors declare no conflicts of interest regarding the publication of this paper.

Shao, Z.J., Feng, B.W, and Li, X.Z. (2021) A Semantically Sensitive Privacy Protection Method for Trajectory Publishing. Journal of Computer and Communications, 9, 35-56. https://doi.org/10.4236/jcc.2021.94003