Classifying Location Points as Daily Activities using Simultaneously Optimized DBSCAN-TE Parameters.

Gregory S. Macfarlane; Gillian Riches; Emily K. Youngs; Jared A. Nielsen

doi:10.32866/001c.116197

Macfarlane, Gregory S., Gillian Riches, Emily K. Youngs, and Jared A. Nielsen. 2024. “Classifying Location Points as Daily Activities Using Simultaneously Optimized DBSCAN-TE Parameters.” Findings, April. https://doi.org/10.32866/001c.116197.

Download all (2)

Figure 1. Value of objective function, averaged across last 1000 iterations.
Download
Figure 2. Distribution of lowest error simulated annealing parameters.
Download

View more stats

Abstract

Location-based services data collected from mobile phones represent a potentially powerful source of travel behavior data, but transforming the location points into semantic activities – where and when activities occurred – is non-trivial. Existing algorithms to label activities require multiple parameters calibrated to a particular dataset. In this research, we apply a simulated annealing optimization procedure to identify the values of four parameters used in a density-based spatial clustering with additional noise and time entropy (DBSCAN-TE) algorithm. We develop a spatial accuracy scoring function to use in the calibration methodology and identify paths for future research.

1. Question

Location-based services (LBS) data contain the spatial locations of many mobile phone users, but they do not independently describe travel behavior (Du and Aultman-Hall 2007). A classification of the LBS data from raw location points to semantic activities is potentially desirable for many reasons, including removing error from travel diaries, improving the quality of travel models, and providing insights into traveler decisions (Bohte and Maat 2009; Usyukov 2017).

The most basic attempts have used heuristic algorithms. Common heuristics include the implied speed between points, or direction between series of points (Deng and Ji 2010). But it is difficult to develop conclusive rules: is a person moving slowly because they are waiting for a traffic signal or because they are doing an activity? Decision trees can expand the number and types of rules (Lee and Lee 2014), potentially increasing the classification accuracy. More advanced methods rely on artificial intelligence (AI), but these can be difficult to deploy at smaller scales, or to train from unlabeled data (Xiao, Juan, and Zhang 2016); it is not clear in the literature how many labeled user-days would be necessary to train the AI. An intermediate approach of density-based clustering algorithms (Duan et al. 2007; Gong, Yamamoto, and Morikawa 2018; Luo et al. 2017) shows promise for particular applications where a neural network may be difficult to train. But these methods require estimating or asserting parameters that may not be immediately intuitive.

In this research, we develop an error function describing the accuracy of activity locations classified by a Density-Based Spatial Clustering with Additional Noise and Time Entropy (DBSCAN-TE) algorithm. We calibrate the algorithm’s parameters by optimizing the function against manually-labeled activity locations.

2. Methods

2.1. DBSCAN-TE Algorithm

The DBSCAN-TE algorithm classifies activities from raw location data in two steps. First, a DBSCAN clustering algorithm (Rehman et al. 2014) finds high-density areas in the spatial data and excludes noise. This algorithm requires two parameters: the radius (\(\varepsilon\)) of an activity cluster; and the minimum number of cluster points (\(\rho\)).

The second step is a time and entropy step. DBSCAN alone cannot distinguish between clusters that are separated by time and not space. For example, individuals may begin their day at home and return later; DBSCAN classifies both sets of points as one activity, because they are at the same location. A time parameter (\(\Delta T\)) is used to separate LBS point in the same cluster that are at least \(\Delta T\) units apart. Finally, DBSCAN-TE considers the “entropy” of points within a candidate cluster, where entropy is a function of the directions between consecutive LBS data points. This eliminates clusters where the device is slowly moving in one direction — as in a traffic queue. The angle between consecutive points is mapped onto sectors of a circle, and then the entropy is calculated as :

\[{S = -\sum_{d - 1}^{D}\frac{n_d}{N}\ln\left(\frac{n_d}{N}\right)}\tag{1}\]

where \(n_d\) is the number of directions falling in sector \(d\), \(N\) is the total number of rays in the cluster, and \(D\) is the total number of sectors (often \(8\))(Gong, Yamamoto, and Morikawa 2018). If \(S<\tau\) (a set threshold), then that cluster is disqualified as an activity point.

2.2. Error Measurement and Calibration

We desire to identify the parameters vector \(\{\varepsilon, \rho, \Delta T, \tau\}\) that minimizes the “error” between activity points generated by DBSCAN-TE and labeled activity points for a user-day. Let \(P_i\) be the set of location points for a user-day \(i\). \(L_i\) is the set of points within a distance of a labeled activity point on the user day \(i\), and \(D_i\) as the set of points within the same distance of an activity point identified by DBSCAN-TE. The total error is

\[{ E = \sum_{i=1}^N \left[ 1 - \frac{|L_{i}\cap D_{i}| + |P_i\setminus (L_i\cup D_i) | }{|P_i|}\right]}\tag{2}\]

where \(L_i\cap D_i\) are the points in both the labeled data and the predicted data, and \(P_i\setminus (L_i\cup D_i)\) are the points assigned to an activity cluster in neither. Thus the error for a user-day \(i\) is effectively the share of points \(i\) that are differently classified by \(L\) and \(D\). The total error is the sum of this value for all user days.

To identify parameters minimizing the error functions, we use a simulated annealing algorithm in R (King, Nguyen, and Ionides 2016). Simulated annealing is useful in finding global optima on non-convex or discontinuous objective functions (Bertsimas and Tsitsiklis 1993). Simulated annealing also permits box constraints on the parameters and parameter scaling: \(\{\varepsilon, \rho, \Delta T, \tau\}\) are defined on different scales with different units. The constraint values are shown in Table 1 values and were set based on an initial search of the literature (Duan et al. 2007; Gong, Yamamoto, and Morikawa 2018; Luo et al. 2017) supported by intuition. For example, we set a minimum \(\Delta T = 5\) minutes (300 seconds), believing a gap this short to be an unrealistic duration for an intervening activity. The scale parameters were set so that the range covered by the parameters was similar across all four dimensions.

Table 1.Parameter Search Boundaries

Parameter	Definition	Lower bound	Upper bound	Scale
\[\varepsilon\]	Radius of cluster [m]	10	100	25
\[\rho\]	Minimum points to constitute a cluster	3	300	75
\[\Delta{T}\]	Time between activities at same point [s]	300	43200	3600
\[\tau\]	Entropy threshold	1	4	1

2.3. Data

Data for this study come from a comprehensive longitudinal dataset of 78 volunteers using a mobile application that collects detailed location-based services data. We drew a random sample of 25 high-quality user-days — defined as 24 hours between 3 AM and 2:59 AM the next day, having a time density of location points of at least 30 points per minute. We mapped the points in GIS software and manually added points at the visually apparent activity locations.

3. Findings

We ran the simulated annealing algorithm for 5000 iterations, beginning with four randomly sampled sets of starting values. Table 2 shows the results of each run alongside a mean value. For most parameters, there is some level of agreement on the scale of the parameters, with three of the four runs matching in the first or second significant digit. The anomalous value differs between runs however, as Run 3 has a low \(\rho\), and Run 4 a low \(\tau\). The mean error of 1.29 implies the algorithm predicts 94.9 percent of points correctly across the 25 user-days.

Table 2.Simulated Annealing Results

Run	\[\varepsilon\]	\[\rho\]	\[\Delta T\]	\[\tau\]	Error
1	17.13	214.67	1529.95	1.35	1.27
2	16.56	207.15	1916.54	1.24	1.28
3	13.13	95.25	2057.13	1.35	1.3
4	15.8	205.66	1188.37	1.00	1.29
MEAN	15.65	180.68	1673	1.23	1.29

Figure 1 shows the mean value of the error function across the parameters \((\rho, \varepsilon)\). If a cell is blank, then the simulated annealing algorithm did not search there, or the error exceeded 1.5. This figure only shows two dimensions of the four in the objective function, so there is variance at each \((\rho, \varepsilon)\) coordinate. Nevertheless, the plot shows consistently lower error for low \(\varepsilon\) and \(\rho \approx 200\). Figure 2 shows a density of each parameter across all four runs with \(E < 1.31\). The results of 55 iterations are included. The distribution is weighted by the inverse error, \(w_j = 1/E_j\) with \(j\) an index for the optimization iteration. Other weight constructions did not result in substantially different interpretations of this figure. The modes of this density plot \(\{\varepsilon = 16.8 \mathrm{\ m}, \rho = 207\mathrm{\ points}, \Delta T = 1,860\mathrm{\ s}, \tau = 1.16\}\) are candidates for the preferred values.

Figure 1.Value of objective function, averaged across last 1000 iterations.

Figure 2.Distribution of lowest error simulated annealing parameters.

Researchers with unlabeled location-based services data may use the DBSCAN-TE algorithm directly using the parameters identified in this research, but should exercise caution. The most sensitive parameter in other data sets is likely to be \(\rho\), the minimum points to constitute a cluster: with less temporally dense data, fewer points will accumulate and the 200 point threshold may not be a viable option. Understanding the relationship between data density and the values of these parameters is important future research.

A different error function could be developed that includes not only the location but the duration of activities, and their sequence in a time-space framework. This would improve the accuracy of the calibration but also increases the difficulty of the labeling task. Similarly, data with a different temporal or spatial resolution may lead to different optimal parameters. Further research should also explore how many labeled user-days are sufficient to identify stable parameters in the DBSCAN-TE algorithm versus train an AI to perform this task accurately.

Acknowledgments

This data used in this research was collected with help from an Interdisciplinary Research Grant at Brigham Young University, and administered under IRB protocol F2020-242. The investigators on the overarching grant include Terisa Gabrielsen, Jared Nielsen, and Mikle South. Myrranda Salmon supported the data cleaning efforts.

Author Contribution Statement

Gregory S. Macfarlane: Conceptualization, Methodology, Software, Formal Analysis, Writing - original draft, Supervision Gillian Martin: Methodology, Software, Investigation, Writing - original draft Emily K. Youngs: Software, Formal Analysis, Data curation Jared A. Nielsen: Resources, Writing - review & editing, Supervision, Project administration, Funding acquisition

Submitted: March 23, 2024 AEST

Accepted: April 05, 2024 AEST

References

Bertsimas, Dimitris, and John Tsitsiklis. 1993. “Simulated Annealing.” Statistical Science 8 (1): 10–15. https://doi.org/10.1214/ss/1177011077.

Google Scholar

Bohte, Wendy, and Kees Maat. 2009. “Deriving and Validating Trip Purposes and Travel Modes for Multi-Day GPS-Based Travel Surveys: A Large-Scale Application in the Netherlands.” Transportation Research Part C: Emerging Technologies 17 (3): 285–97. https://doi.org/10.1016/j.trc.2008.11.004.

Google Scholar

Deng, Zhongwei, and Minhe Ji. 2010. “Deriving Rules for Trip Purpose Identification from GPS Travel Survey Data and Land Use Data: A Machine Learning Approach.” Traffic and Transportation Studies 2010, July, 768–77. https://doi.org/10.1061/41123(383)73.

Google Scholar

Du, Jianhe, and Lisa Aultman-Hall. 2007. “Increasing the Accuracy of Trip Rate Information from Passive Multi-Day GPS Travel Datasets: Automatic Trip End Identification Issues.” Transportation Research Part A: Policy and Practice 41 (3): 220–32. https://doi.org/10.1016/j.tra.2006.05.001.

Google Scholar

Duan, Lian, Lida Xu, Feng Guo, Jun Lee, and Baopin Yan. 2007. “A Local-Density Based Spatial Clustering Algorithm with Noise.” Information Systems 32 (7): 978–86. https://doi.org/10.1016/j.is.2006.10.006.

Google Scholar

Gong, Lei, Toshiyuki Yamamoto, and Takayuki Morikawa. 2018. “Identification of Activity Stop Locations in GPS Trajectories by DBSCAN-TE Method Combined with Support Vector Machines.” Transportation Research Procedia 32:146–54. https://doi.org/10.1016/j.trpro.2018.10.028.

Google Scholar

King, Aaron A., Dao Nguyen, and Edward L. Ionides. 2016. “Statistical Inference for Partially Observed Markov Processes via theRPackagePomp.” Journal of Statistical Software 69 (12): 1–43. https://doi.org/10.18637/jss.v069.i12.

Google Scholar

Lee, Jae Sung, and Eun Sung Lee. 2014. “Exploring the Usefulness of a Decision Tree in Predicting People’s Locations.” Procedia - Social and Behavioral Sciences 140 (August):447–51. https://doi.org/10.1016/j.sbspro.2014.04.451.

Google Scholar

Luo, Ting, Xinwei Zheng, Guangluan Xu, Kun Fu, and Wenjuan Ren. 2017. “An Improved DBSCAN Algorithm to Detect Stops in Individual Trajectories.” ISPRS International Journal of Geo-Information 6 (3): 63. https://doi.org/10.3390/ijgi6030063.

Google Scholar

Rehman, Saif Ur, Sohail Asghar, Simon Fong, and S. Sarasvady. 2014. “DBSCAN: Past, Present and Future.” The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), February, 232–38. https://doi.org/10.1109/icadiwt.2014.6814687.

Google Scholar

Usyukov, Vladimir. 2017. “Methodology for Identifying Activities from GPS Data Streams.” Procedia Computer Science 109:10–17. https://doi.org/10.1016/j.procs.2017.05.289.

Google Scholar

Xiao, Guangnian, Zhicai Juan, and Chunqin Zhang. 2016. “Detecting Trip Purposes from Smartphone-Based Travel Surveys with Artificial Neural Networks and Particle Swarm Optimization.” Transportation Research Part C: Emerging Technologies 71 (October):447–63. https://doi.org/10.1016/j.trc.2016.08.008.

Google Scholar

Classifying Location Points as Daily Activities using Simultaneously Optimized DBSCAN-TE Parameters.

Abstract

1. Question

2. Methods

2.1. DBSCAN-TE Algorithm

2.2. Error Measurement and Calibration

2.3. Data

3. Findings

Acknowledgments

Author Contribution Statement

References

This website uses cookies