Mohamed Mokbel

Thinking Spatial

Databases, Recommendations, Spatial, Systems
Self-driving cars, ride-sharing service (e.g., Uber and Lyft), and Pokemon Go are just three examples of recent disruptive applications that gained huge market share and publicity. It is expected that each self-driving car will generate 2 PB of data per year, with 10 Million of such cars by 2020. Uber has 2+ Billion rides so far, and is adding 60+ Million rides monthly. Pokemon Go had more first week downloads from the Apple App store than any other app in the history. There are many common things among these applications. One of these commonalities is that they are all map-based and mainly rely on the whereabouts of their customers and objects. They can all be classified under the wide umbrella of location-based services, where the spatial attributes and operations are treated as first-class citizens.

A main problem of having that over-the-counter database systems is that we lost the opportunity of talking directly with major customers.

Meanwhile, the database “systems” community has been dealing with the spatial attributes of any object as just one more attribute, with not much special support. System builders mostly build general-purpose systems that are generic enough to handle any kind of attributes. Whenever there is a pressing need for spatial data support, it is considered as an afterthought problem that can be addressed by adding new data types, extensions, or spatial cartridges to existing systems. We end up producing an over-the-counter system that can be used by other applications, regardless of their very specific needs. A main problem of having that over-the-counter concept is that we lost the opportunity of talking directly with major customers. For example, think of the three disruptive, very successful, and definitely promising, applications that we have been talking about, self-driving cars, ride sharing, and Pokemon Go. They do have tons of data that they need to manage. Our first response to them is to use our general-purpose systems to manage their own data, as we have done excellent job in making it general enough to work for anyone. With this, we kind of ending the discussion so early, and distance ourselves from some real needs of these new applications. We need to be closer to such kinds of new applications, and build specially designed systems for them. Our community knows how to build systems, and we are best suited to tailor our systems to specific needs, so, why not doing that? Knowing that these applications care a lot about spatial data and spatial attributes, why not designing systems that care about these data the most. Why not building special-purpose systems for spatial data?

There are two arguments against designing such special-purpose systems. First, how big is the market segment that needs spatial data support? Second, would it be really different from thinking of spatial data support as an afterthought problem? For the first argument, spatial information becomes ubiquitous. In addition to the massive numbers I have listed earlier, think of the number of users who use maps on their phones for routing, store locators, or finding family whereabouts. The market is really huge, and it is our role to build specialized systems for that market segment, otherwise, we would lose that segment for some other community. For the second argument, I would strongly argue that things would be really different if we start thinking spatial when building our systems. I will list here few examples of systems that would look really different if we start building them while thinking spatial:

Database systems: A commercial DBMS can easily support a nearest-neighbor query through a simple SQL query that selects object ID from the table of objects, ordered by distance, and limit one on the answer. Yet, this is extremely inefficient due to the need of calculating the distance between the user location and each object in the table and sorting the results. Commercial DBMSs may not care much about this as having a nearest-neighbor query is not a common thing, hence its performance does not hurt much. Thinking spatial, we would consider having a specially designed nearest-neighbor operator that can be added to a query plan with other query operators. This also means modifying the query optimizer to consider optimizing query plans with the nearest-neighbor operator, as well as having new spatial index structures to support that important query. This is just an example of one important spatial operation. There is a real need to build DBMSs with such spatial important operations in mind.

HDFS-based systems: Many of the new big data systems are based on Hadoop Distributed File System (HDFS), where the data is organized in partitions with default size (e.g., 64MB or 128MB). HDFS deals with partitions as a heap structure, where any query would need to scan all partitions, applying any filters on the fly. Spatial queries are dealt with as any other queries, considering that it has a spatial filter that we can deal with in the same way as other non-spatial filters. This is not friendly to spatial applications, where they would prefer to have special first-class treatment for spatial operations. Thinking spatial, we will ensure that data in the same partition are spatially close to each other. Then, each partition will be annotated by a Minimum Bounding Rectangle (MBR) that encloses all its spatial contents. Given a query with spatial filter, a simple look on the partition annotation would tell if there is a need to scan the contents of that partition. This will help in having significant pruning for spatial operations, making such big data systems friendlier to new emerging spatial applications.

Recommender systems: Recommender systems make use of community opinions to help users identify useful items from a considerably large search space (e.g., Amazon inventory, Netflix movies). The technique used by many of these systems is collaborative filtering, which analyzes past community opinions to find correlations of similar users and items. Community opinions are expressed through explicit ratings represented by the triple (user, rating, item) that represents a user providing a numeric rating for an item. Unfortunately, recommender systems are not friendly to spatial operations. For example, one may want to have recommendations on a restaurant in a certain area, or a tourist wants to get recommendation of items preferred by locals, e.g., “When in Rome, do as the Romans do”. Trying to get such spatial recommendations from existing recommender systems would be just a spatial filter on top of existing systems, which is not accurate as this would lose the essence of the collaborative filtering method. Thinking spatial, we would change the rating triple to be a five-tuple (user, user_location, rating, item, item_location) to include the user and item locations, if known. Then, the collaborative filtering functionality would need to be built with the knowledge of these spatial locations. In other words, the spatial attributes will be pushed to the core of the collaborative filtering functionality rather than being an on-top filter.

Social networks: Social network systems show a news feed (i.e., set of posts) to their users that is either ordered temporally or based on importance. The importance is mostly based on how popular each post is. Thinking spatial, I would like to see my news feed more related to my spatial location. For example, let’s say that one of my friends has visited Istanbul and posted something about it. I saw the post, and ignored it as it does not really matter to me now, being in Minneapolis. Few months down the road, I am in Istanbul, looking at my news feed. The most important post I would like to see now is that one that was related to Istanbul and posted few months ago. Yet, as the social network is not designed for spatial awareness, it could not recognize that this one is much more important to me now than other posts. Thinking spatial, a social network would relate each post to a location, and give it a spatial domain of interest. Such spatial information should play a major role in deciding what would be seen for each user. Knowing my whereabouts, the social network should be able to show me what is spatially related to me at the moment.

Crowd sourcing: The idea of crowd sourcing is to assign a certain task to a group of people who will solve it. Amazon Mechanical Turk is a prime example, where given a task, a set of workers step in to solve it and get paid for it. Finding the right set of workers to a certain task does not consider much the locations of those workers. Thinking spatial, many of the tasks are spatially oriented, where the location of the worker plays an important role in adequately performing the task. For example, rating a restaurant would be preferred to be done locally, exactly geolocating an object that we have a vague idea on its whereabouts would need to be solved by people living around that object. Also, some areas in the world would have more expertise in some jobs than others, e.g., translation tasks, or sport-related tasks. In general, having the locations of workers ahead in the equation would change the way that we assign workers to crowdsourcing tasks to achieve better quality.

The list goes on and on to include systems designed for data streaming, data privacy, data cleaning, and data integration, among others. And, the question comes again. Is it really worth it building such systems while Thinking Spatial? I would say definitely yes, it is worth it. Spatial information is really special, and it should not be considered as few more attributes. If we pass on this opportunity, other communities will chime in and start paying more attention to the needs of spatial data applications to support it than us.

Blogger Profile

Mohamed F. Mokbel (Ph.D., Purdue University, MS, B.Sc., Alexandria University) is Associate Professor in the Department of Computer Science and Engineering, University of Minnesota. His research interests include the interaction of GIS and location-based services with database systems and cloud computing. His research work has been recognized by the VLDB 10-Years Best Paper Award, five Best Paper Awards, and by the NSF CAREER award. Mohamed is/was the program co-chair for ACM SIGMOD 2018, ACM SIGSPATIAL GIS from 2008 to 2010, and IEEE MDM Conference 2011 and 2014, and the General Chair for SSTD 2011. He is an Associate Editor for ACM TODS, ACM TSAS, VLDB journal, and GeoInformatica. Mohamed is an elected Chair of ACM SIGSPATIAL 2014-2017.

Copyright @ 2016, Mohamed Mokbel, All rights reserved.