Zhifeng Bao

Managing and Exploiting Massive Geolocation Data

Big Data, Spatial

The sheer volume, variety, and velocity of data in this modern era have enabled significant advancements in many research areas. However, the advancements in the research community thanks to Big Data do not necessarily translate to the benefit of society; of ordinary people living ordinary lives. There is indeed a gap between breakthroughs in the scientific community and their impact on real-world applications. Thus, in 2016 we initiated the Civil Computing project [1] where our mission is also aligned with the UN-Habitat Data and Analysis Section [2]: we exploit publicly available comprehensive urban data to bridge the gap between theory and practice; we design a suite of data-driven solutions to deliver real value as insights not only for individuals to improve the quality of their lives, but also for non-technical policymakers such as governments to make wiser and more benevolent decisions backed by wisdom extracted from big data.

The team has made contributions to management and analytics on a diverse range of geolocation data that encompass many domains and applications, such as improving urban connectivity through public transport and ride-sharing optimization, which utilize trajectory and public transportation data; and comprehensive visual exploration of points of interest (POI) and areas of interest (AOI) data. From these data and problems, we identify challenges that we face along the way, including handling scalability issues, providing a precise definition of outcome effectiveness and its optimization, and defining human recognition that considers fairness for people from different demographics. We formulate core research questions, and then develop novel query models, robust algorithms and visual analytic processes to compute and synthesize the results and build usable system prototypes. The team also aids local governments in urban planning and governance, such as the City of Melbourne and the Department of Health and Human Services Victoria, Australia.

Exploration and Exploitation of POI and AOI Data

Finding a suitable property to purchase or rent is critical in people’s daily lives. This process can be overwhelming and time-consuming since there are many factors to consider, including budgets, facilities such as supermarkets, nearby transportation, and school zones. Together with Timos Sellis, we launched the Geolocation Data Exploration project in real estate domain, aiming to make the above process as easy and efficient as possible. First, we collect and integrate data from a wide range of resources to build a comprehensive Australia-wide real estate dataset consisting of up to 72 attributes under five profiles: regional, educational, transportation, facility, and house profile. Next, we build an interactive visual data exploration system, HomeSeeker [3], for users to explore such geo-related multidimensional POI data on the map. It assists users in understanding the local areas and real estate market, exploring and finding candidate properties based on diverse individual requirements, and visually comparing properties/suburbs in multiple aspects as given by the user.

Users’ various exploration needs are transformed into different types of query processing at the back end, such as range query, distance query, nearest neighbor query, and aggregation query. A major pain point is encountered when many POIs meet the user’s query needs. Visualizing all of them on the map could cause both perceptual and interactive scalability problems. This motivates us to design an effective yet efficient algorithm to select a small set of representative POIs, where any two selected POIs should not be too close to each other such that users can easily distinguish them on the map. Moreover, such an algorithm needs to cater to users’ interaction with the map such that representative POIs are updated in interactive speed w.r.t. the change of user’s region of interest, as a result of user’s zooming and panning operations on the map [4].

Furthermore, we develop a cluster-based data structure called ConcaveCubes [5] to support interactive visualizations of large-scale multidimensional AOIs. In particular, we propose a novel concave hull construction method to support boundary-based cluster visualization on the map while preserving real-world geographical semantics. Instead of calculating the clusters on demand every time, ConcaveCubes reuse previous (intermediate) visualization results to meet interaction demands. The last efficiency bottleneck is generating the region boundary of those AOIs, and notably, those AOIs are arbitrarily user-defined. Our solution is AOI-shapes [6], a parameter-free footprint method that can recognize multiple regions, outliers, and inner holes of the AOI.

HomeSeeker also enables us to study a range of problems that exploit geo-located data to benefit the decision making of buyers or sellers. We highlight some of the problems here. The first one is multi-level explainable spatial object recommendation, which is a joint work with Baidu Research on its map service [7]. By leveraging the intrinsic spatial containment relationships among POIs/AOIs, we provide different recommendations at different stages and granularities of user’s exploration on the map, as well as the hints deriving each recommendation. The second one is incremental user preference adjustment. Most recommendation systems depend on lists of items rank-ordered according to the user’s preference, but what if an individual user finds such a personalized rank still undesired? Ideally, we want the system to adjust its estimate of users’ preferences after every simple interaction, thereby becoming progressively better at giving the user what she wants. We also want these adjustments to be gradual and explainable, so that the user is not surprised by wild swings in the system rank ordering. This inspires us to support a rank-reversal operation on two items x and y for users, i.e., adjusting the user’s preference such that the personalized rank of x and y is reversed [8]. This problem is orthogonal to the preference learning and our preference adjustment techniques enable all those existing offline preference learning models to incrementally and interactively improve their response to (indirectly specified) user preferences. The third one is a rank aggregation problem over streaming user queries.  HomeSeeker facilitates house finding based on spatial preference. One possible ranking of each house is based on the (weighted) distance from a preferred location (close to a school, subway station, or supermarket). A house that is ranked higher by many users’ queries has a higher aggregate rank and is likely more popular in the market. The challenge is that user queries are unknown a priori and come in a streaming fashion. Our goal is to keep monitoring and reporting the top-k spatial objects with the highest aggregate rank [9].

Management and Exploitation of Trajectory Data

Trajectory analytics can benefit many real-world applications such as traffic monitoring, public transit planning, carpooling, and site selection [10]. Most analytics at their core parts can be dissolved into various fundamental queries on trajectory data, which can be divided into two broad categories: basic queries (i.e., spatial range query, spatio-temporal range query, and ID-temporal query), and advanced queries (i.e., threshold-based trajectory similarity search/join, top-k trajectory similarity search/join, and sub-trajectory similarity search). Depending on the applications and domains, raw trajectories and road network-constrained trajectories are two main target data types. We have developed algorithms and system prototypes to support efficient and scalable query processing on each of them respectively.

On the management of raw trajectory data, we have collaborated with Alibaba Cloud. We find that different users have different query requirements, while to our best knowledge there has not been such a system to support all of them. Moreover, such a system should consistently perform well on trajectories of different characteristics, in terms of spatial span, trajectory density, and number of points per trajectory; trajectories of vehicles could be drastically different from airplanes and vessels. The storage cost is another major concern for cloud service consumers. To this end, we build a versatile, robust, and economical trajectory data management system [11], whose architecture is outlined in the above Figure. We separate the storage layer from the processing layer; allowing them to be scaled out individually. Here, ‘Versatile’ means that the proposed system should support all typical queries demanded in industry and similarity metrics on the trajectories (three basic queries, five advanced queries, and five widely used similarity metrics). We develop a two-stage processing framework and a uniform pruning framework. We also design tailored parallel algorithms and pushdown strategies to further boost the system performance. These designs and optimizations also enable the system to work well on datasets with varying properties aforementioned (‘Robust’) with much fewer resources to achieve similar or even better performance than the existing Spark-based systems (‘Economical’ in the computing layer). With the storage layout design and the secondary index in the storage layer, the proposed system achieves 3x smaller storage overhead (‘Economical’ in the storage layer).

For network-constrained trajectory data, we build a system prototype to support four basic queries and six similarity metrics on network-constrained trajectories, where a unified index is designed and different pruning strategies are proposed [12], and we study the problem of trajectory clustering to find the k representative routes on a road network [13]. We also exploit trajectory data, which captures users’ movement records, to study a range of benefit maximization instances in (commercial or facility) site selection [14,15].

Transport Optimization

Intelligent transportation systems are advanced applications that aim to provide innovative services relating to different modes of transport and traffic management. These enable users to be better informed and make safer, more coordinated, and ‘smarter’ use of transport networks [16].  We pursue the topic of public transportation and ridesharing with the hopes of benefiting both individuals and the government. For individuals, our research aims to improve the efficiency and effectiveness of public transport to improve community mobility and social equity. For governments, our research aims to aid their endeavors in bettering the lives of citizens through our data-driven approaches.

Public transportation provides crucial support to many cities’ socioeconomic activities. Additionally, as an alternative to automobiles, public transportation mitigates the negative impacts of urban sprawl, reduces road congestion, and reduces environmental emissions. We have explored the topic of public transportation from three perspectives: equitable bus network optimization, optimization through bus frequency modifications, and optimization through route generation. In [17], we perform a pioneer work on equitable bus network optimization. We formalize a comprehensive notion of bus network efficiency and bus network equity. Our efficiency metric considers three essential components: route directness, waiting time, and the number of transfers. For the equity factor, we propose five metrics that analyze bus network equity based on social and spatial characteristics. Since this work is the first of its kind, we curate the first equitable bus network optimization datasets and make them publicly available. We perform several case studies on the bus network of Singapore to analyze the efficiency and equity of this bus network. In [18], we explore the public bus frequency optimization problem by studying how to (re)schedule buses such that the total number of passengers who could receive bus services within a waiting time threshold can be maximized. Moreover, we study how to optimize the bus network by generating new bus routes, such that both the connectivity of the network can be improved and meanwhile the commuting demand can be met as much as possible [19].

Ridesharing can encourage users with similar itineraries and time schedules to share their trips. Such a service can save money, reduce traffic congestion, and increase car seat utilization, benefiting both drivers and riders. In [20], we aim to study how to find a bilateral matching between a set of drivers and excessive riders in peak hour periods under a series of spatial-temporal constraints dynamically. Furthermore, we study how to incorporate the willingness of all stakeholders (i.e., the platform, workers, and riders) and find the matchable worker-rider pairs to minimize the regret which includes the rate of unserved requests and the portion of revenue loss from unserved riders [21].


Despite the advancements in research and the availability of massive geolocation data, the gap between theory and practice limits the real-life applicability of these breakthroughs. We endeavor to bridge this gap to enable research that more directly benefits the good of the society by supporting not only the general population, but through our efforts and collaboration with governments to support wiser decision-making that will benefit the public. This great undertaking can only be achieved through the efforts of the larger research community, thus we also promote openness of benchmark and data collection as an impetus to a more concentrated community effort.

Blogger Profile

Zhifeng Bao received his Ph.D. in Computer Science in 2011 from National University of Singapore. He is a Professor of the School of Computing Technologies in RMIT University and an Honorary Senior Fellow in The University of Melbourne. His research interests include query processing and optimization, data quality and data exploration. He serves as an Associate Editor of PVLDB and ACM Transactions on Spatial Algorithm and Systems. He was a recipient of the Chris Wallace Award for Outstanding Research and Google Faculty Research Awards. Zhifeng currently co-directs the RMIT Research Centre for Information Discovery and Data Analytics.


[1] Civil Computing. Retrieved from http://civilcomputing.co/

[2] UN Habitat Urban Indicator Database. Retrieved from https://data.unhabitat.org/

[3] M. Li, Z. Bao, T. Sellis, S. Yan, and R. Zhang. HomeSeeker: A Visual Analytics System of Real Estate Data. Journal of Visual Languages & Computing 45: 1-16 (2018).

[4] T. Guo, K. Feng, G. Cong, and Z. Bao. Efficient selection of geospatial data on maps for interactive and visualized exploration. SIGMOD 2018: 567-582.

[5] M. Li, F. Choudhury, Z. Bao, H. Samet, and T. Sellis. ConcaveCubes: Supporting Cluster‐based Geographical Visualization in Large Data Scale. In Computer Graphics Forum, 37(3): 217-228 (2018).

[6] M. Li, Z. Bao, F. Choudhury, H. Samet, M. Duckham, and T. Sellis. AOI-shapes: An Efficient Footprint Algorithm to Support Visualization of User-defined Urban Areas of Interest. ACM TIIS 11, no. 3-4: 1-32 (2021).

[7] H. Luo, J. Zhou, Z. Bao, S. Li, J. Culpepper, H. Ying, H. Liu, and H. Xiong. Spatial Object Recommendation with Hints: When spatial granularity matters. SIGIR 2020: 781-790.

[8] L. Song, J. Gan, Z. Bao, B. Ruan, H. V. Jagadish, and T. Sellis. Incremental Preference Adjustment: A Graph-theoretical Approach. The VLDB Journal 29(6): 1475-1500 (2020).

[9] F. Choudhury, Z. Bao, J. Culpepper, and T. Sellis. Monitoring the Top-m Rank Aggregation of Spatial Objects in Streaming Queries. ICDE 2017: 585-596.

[10] S. Wang, Z. Bao, J. Culpepper, and G. Cong: A Survey on Trajectory Data Management, Analytics, and Learning. ACM Comput. Surv. 54(2): 39:1-39:36 (2021).

[11] H. Lan, J. Xie, Z. Bao, F. Li, W. Tian, F. Wang, S. Wang, and A. Zhang. VRE: A Versatile, Robust, and Economical Trajectory Data System. Proc. VLDB Endow. 15(12): (2022).

[12] S. Wang, Z. Bao, J. Culpepper, Z. Xie, Q. Liu, and X. Qin: Torch: A Search Engine for Trajectory Data. SIGIR 2018: 535-544.

[13] S. Wang, Z. Bao, J. Culpepper, T. Sellis, and X. Qin: Fast Large-Scale Trajectory Clustering. Proc. VLDB Endow. 13(1): 29-42 (2019).

[14] P. Zhang, Z. Bao, Y. Li, G. Li, Y. Zhang, and Z. Peng. Trajectory-driven Influential Billboard Placement. KDD 2018: 2748-2757.

[15] Y. Zhang, Y. Li, Z. Bao, B. Zheng, and H. V. Jagadish. Minimizing the Regret of an Influence Provider. SIGMOD 2021: 2115-2117.

[16] Directive 2010/40/EU of the European Parliament and of the Council on the framework for the deployment of Intelligent Transport Systems in the field of road transport and for interfaces with other modes of transport (2010) (European Union).

[17] D. Tedjopurnomo, Z. Bao, F. Choudhury, H. Luo, and A. K. Qin. Equitable Public Bus Network Optimization for Social Good: A case study of Singapore. ACM FAccT  2022: 278-288.

[18] S. Mo, Z. Bao, B. Zheng, and Z. Peng. Towards an Optimal Bus Frequency Scheduling: When the Waiting Time Matters. IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2020.3036573.

[19] S. Wang, Y. Sun, C. Musco, and Z. Bao. Public Transport Planning: When Transit Network Connectivity Meets Commuting Demand. SIGMOD 2021: 1906-1919.

[20] H. Luo, Z. Bao, F. Choudhury, and J. Culpepper. Dynamic Ridesharing in Peak Travel Periods. IEEE Trans. Knowl. Data Eng. 33(7): 2888-2902 (2021).

[21] T. Wang, H. Luo, Z. Bao, and L. Duan. Dynamic Ridesharing with Minimal Regret: Towards an Enhanced Engagement Among Three Stakeholders. IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2022.3141368.

Copyright @ 2022, Zhifeng Bao, All rights reserved.