Is Query Optimization a “Solved” Problem?

lohman2
Is Query Optimization a “solved” problem? If not, are we attacking the “right” problems? How should we identify the “right” problems to solve?

I asked these same questions almost exactly 25 years ago, in an extended abstract for a Workshop on Database Query Optimization that was organized by the then-Professor Goetz Graefe at the Oregon Graduate Center [Grae 89a]. Remarkably and quite regrettably, most of the issues and critical unsolved problems I identified in that brief rant remain true today. Researchers continue to attack the wrong problems, IMHO: they attack the ones that they can, i.e., that they have ideas for, rather than the ones that they should, i.e., that are critical to successfully modeling the true cost of plans and choosing a good one. Perhaps more importantly, that will avoid choosing a disastrous plan! At the risk of repeating myself, I’d like to re-visit these issues, because I’m disappointed that few in the research community have taken up my earlier challenge.

divider2

The root of all evil, the Achilles Heel of query optimization, is the estimation of the size of intermediate results, known as cardinalities. Everything in cost estimation depends upon how many rows will be processed, so the entire cost model is predicated upon the cardinality model. In my experience, the cost model may introduce errors of at most 30% for a given cardinality, but the cardinality model can quite easily introduce errors of many orders of magnitude! I’ll give a real-world example in a moment. With such errors, the wonder isn’t “Why did the optimizer pick a bad plan?” Rather, the wonder is “Why would the optimizer ever pick a decent plan?”

“Well,” you say, “we’ve seen lots of improvements in histograms, wavelets, and other statistics since 1989. Surely we do better now.” There’s been no shortage of such papers, it’s true, but the wealth of such papers precisely illustrates my point. Developing new histograms that improve selectivity estimation for individual local predicates of the form “Age BETWEEN 47 AND 63” by a few percent doesn’t really matter, when other, much larger errors that are introduced elsewhere in cardinality estimation dwarf those minor improvements. It’s simple engineering, folks. If I have to review one more such paper on improved histograms for local predicates, I’ll scream (and reject it)! It just doesn’t matter! What we have now is good enough.

What still introduces the most error in cardinality estimation is (a) host variables and parameter markers, (b) the selectivity of join predicates, and, even more significantly, (c) how we combine selectivities to estimate the cardinality. Amazingly, these three topics also have enjoyed the least research attention, or at least the fewest number of papers attempting to solve them, unless I’ve missed some major contributions lately. I’ll visit each of these topics in turn, describing the fundamental causes of errors, and why those errors can easily reach disastrous proportions, illustrated by war stories from real customer situations.

Host variables and parameter markers
Host variables, parameter markers, and special registers occur in SQL queries because applications on top of the DBMS, not humans, invoke most queries, unlike in academic papers. Such applications typically get the constants used in predicates from a “fill in the blank” field on a web page, for example. The SQL predicate then looks like “Age BETWEEN :hv1 AND :hv2”. At compile time, the optimizer has no clue what :hv1 and :hv2 are, so it cannot look them up in those wonderful histograms that we all have polished. Instead, it must make a wild guess on the average or likely values, which could be off significantly from one execution to the next, or even due to skew. A war story illustrates this point.

One of our major ISVs retrofitted a table with a field that identified which subsystem each row came from. It had 6 distinct values, but 99.99% of the rows had the value of the founding subsystem, i.e., when there was only one. A predicate on this subsystem column was added to every query, with the value being passed as a host variable. Not knowing that value a priori, DB2’s optimizer used the average value of 1/|distinct values| = 0.167, though that predicate’s true selectivity was usually 0.9999 (not selective at all) and once in a blue moon was 0.0001 (extremely selective).

There has been some work on this so-called Parametric Query Optimization (PQO), though it’s sometimes attacking the problem of other parameters unknown at compilation time (e.g. the number of buffer pages available) or limited to discrete values [Ioan 97]. One of my favorites is a fascinating empirical study by Reddy and Haritsa [Redd 05] of plan spaces for several commercial query optimizers as the selectivity of multiple local predicates are varied. It demonstrated (quite colorfully!) that regions in which a particular plan is optimal may not be convex and may even be disconnected! Graefe suggested keeping a different plan for each possible value of each host variable [Grae 89b], but with multiple host variables and a large number of possible values, Graefe’s scheme quickly gets impractical to optimize, store, and decide at run-time among the large cross-product of possible plans, without grouping them into regions having the same plan [Stoy 08].

Version 5 of DB2 for OS/390 (shipped June 1997) developed a practical approach to force re-optimization for host variables, parameter markers, and special registers by adding new SQL bind options REOPT(ALWAYS) and REOPT(ONCE). The latter re-optimizes the first time that the statement is executed with actual values for the parameters, and assumes that these values will be “typical”, whereas the former forces re-optimization each time the statement is run. Later, a REOPT(AUTO) option was added to autonomically determine if re-optimization is needed, based upon the change in the estimated filter factors from the last re-optimization’s plan.

Selectivity of join predicates
The paucity of innovation in calculating join predicate selectivities is truly astounding. Most extant systems still use the techniques pioneered by System R for computing the selectivity of a join as the inverse of the maximum of the two join-column cardinalities [Seli 79], which essentially assumes that the domain of one column is a subset of the other. While this assumption is valid for key domains having referential integrity constraints, in general the overlap of the two sets may vary greatly depending upon the semantics of the joined domains. Furthermore, the common practice of pre-populating a dimension table such as “Date” with 100 years of dates into the future can incorrectly bias this calculation when the fact table initially has only a few months of data. Statistics on join indexes [Vald 87] would give much more precise selectivities for join predicates, if we were willing to suffer the added costs of maintenance and lock contention of updates to these join indexes. However, join indexes would not solve the intersection problem typical of star schemas.

For example, suppose a fact table of Transactions has dimensions for the Products, Stores, and Dates of each transaction. Though current methods provide accurate selectivity estimates for predicates local to each dimension, e.g., ProductName = ‘Dockers’ and StoreName = ‘San Jose’ and Date = ’23-Feb-2013’, it is impossible to determine the effect of the intersection of these predicates on the fact table. Perhaps the San Jose store had a loss-leader sale on Dockers that day that expired the next day, and a similar sale on some other product the next day, so that the individual selectivities for each day, store, and product appear identical, but the actual sales of Dockers on the two days would be significantly different. It is the interaction of these predicates, through join predicates in this case, that research to date doesn’t address. This leads naturally to the final and most challenging problem in our trifecta.

Correlation of columns
With few exceptions ([Chri 83], [Vand 86]), query optimizers since [Seli 79] have modeled selectivities as probabilities that a predicate on any given row will be satisfied, then multiplied these individual selectivities together. Effectively, this assumes that Prob{ColX = Value1, ColY = Value2} = Prob{ColX = Value1} * Prob{ColY = Value2}, i.e., that the two predicates are probabilistically independent, if you recall your college probability. Fortunately for the database industry, this assumption is often valid. However, occasionally it is not.

My favorite example, which occurred in a customer database, is Make = ‘Honda’ and Model = ‘Accord’. To simplify somewhat, suppose there are 10 Makes and 100 Models. Then the independence (and uniformity) assumption gives us a selectivity of 1/10 * 1/100 = 0.001. But since only Honda makes Accords, by trademark law, the real selectivity is 0.01. So we will under-estimate the cardinality by an order of magnitude. Such optimistic errors are much worse than pessimistic over-estimation errors, because they cause the optimizer to think that certain operations will be cheaper than they really are, causing nasty surprises at run time. The only way to avoid such errors is for the database administrator to be aware of the semantic relationship (a functional dependency, in this case) between those two columns and its consequences, and to collect column group statistics, as DB2 and other database products now allow.

To identify these landmines in the schema automatically, Stillger et al. [Stil 01] developed the LEarning Optimizer (LEO), which opportunistically and automatically compared run-time actual cardinalities to optimizer estimates, to identify column combinations exhibiting such correlation errors. Ilyas et al. [Ilya 04] attacked the problem more pro-actively in CORDS (CORrelation Detection by Sampling), searching somewhat exhaustively for such correlations between any two columns in samples from the data before running any queries. And Markl and colleagues [Mark 05], [Mark 07] have made ground-breaking advances on a consistent way to combine the selectivities of conjuncts in partial results.

All great progress on this problem, but none yet solves the problem of redundant predicates that can be inadvertently introduced by the query writer who typically believes that “more is better”, that providing more predicates helps the DBMS do its job better – it’s American as Apple Pie! Let me illustrate with one of my favorite war stories.

At a meeting of the International DB2 User’s Group, a chief database administrator for a major U.S. insurance company whom I’d helped with occasional bad plans asked me to conduct a class on-site. I suggested it include an exercise on a real problem, unrehearsed. After my class, she obliged me by presenting two 1-inch stacks of paper, each detailing the EXPLAIN of a plan for a query. I feared I was going to embarrass myself and fail miserably under the gun. The queries differed in only one predicate, she said, but the original ran in seconds whereas the one with the extra predicate took over an hour.

I instinctively examined first the cardinality estimates for the results of the two, and the slower one had a cardinality estimate 7 orders of magnitude less than the fast one. When asked what column the extra predicate was on, my host explained that it was a composite key constructed of the first four letters of the policy-holder’s last name, the first and middle initials, the zip code, and last four digits of his/her Social Security Number. Did the original query have predicates on all those columns? Of course! And how many rows were there in the table? Ten million. Bingo! I explained that that predicate was completely redundant of the others, and its selectivity, 1/107, when multiplied by the others, underestimated the cardinality by 7 orders of magnitude, wreaking havoc with the plan. It took me maybe 5 minutes to figure this out, and I was immediately dubbed a “genius”, but it really was straightforward: the added predicate might help the run-time, especially if they had an index on that column, but it totally threw off the optimizer, which couldn’t possibly have detected that redundancy without LEO or CORDS.

So c’mon, folks, let’s attack problems that really matter, those that account for optimizer disasters, and stop polishing the round ball.

Disclaimer: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

References

[Chri 83] S. Christodoulakis, “Estimating record selectivities”, Info. Systems 8,2 (1983) pp. 105-115.

[Grae 89a] G. Graefe, editor, Workshop on Database Query Optimization, CSE Technical Report 89-005, Oregon Graduate Center, Portland, OR, 30 May 1989.

[Grae 89b] G. Graefe, “The stability of query evaluation plans and dynamic query evaluation plans”, Proceedings of ACM-SIGMOD, Portland, OR (1989).

[Ioan 97] Y. Ioannidis et al., “Parametric query optimization”, VLDB Journal, 6(2), pp. 132 – 151, 1997.

[Ilya 04] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, A. Aboulnaga: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. SIGMOD Conference 2004: 647-658.

[Mark 05] V. Markl, N. Megiddo, M. Kutsch, T. Minh Tran, P. J. Haas, U. Srivastava: Consistently Estimating the Selectivity of Conjuncts of Predicates. VLDB 2005: 373-384.

[Mark 07] V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, T. Minh Tran: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1): 55-76 (2007)

[Redd 05] N. Reddy and J. R. Haritsa. “Analyzing plan diagrams of database query optimizers”, VLDB, 2005.

[Seli 79] P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price, “Access path selection in a Relational Database Management System”, Procs. Of ACM-SIGMOD (1979), pp. 23-34.

[Stil 01] M. Stillger, G. M. Lohman, V. Markl, M. Kandil: LEO – DB2′s LEarning Optimizer. VLDB 2001: 19-28.

[Stoy 08] J. Stoyanovich, K. A. Ross, J. Rao, W. Fan, V. Markl, G. Lohman: ReoptSMART:A Learning Query Plan Cache. Technical report.

[Vald 87] P. Valuriez, “Join indices”, ACM Trans. on Database Systems 12, 2 (June 1987), pp. 218-246.

[Vand 86] B.T. Vander Zanden, H.M. Taylor, and D. Bitton, “Estimating block accesses when attributes are correlated, Procs. of the 12th Intl. Conf. on Very Large Data Bases (Kyoto, Sept. 1986), pp. 119-127.

Blogger’s Profile:
Dr. Guy M. Lohman is Manager of Disruptive Information Management Architectures in the Advanced Information Management Department at IBM Research Division’s Almaden Research Center in San Jose, California, where he has worked for 32 years. He currently manages the Blink research project, which contributed BLU Acceleration to DB2 for Linux, UNIX, and Windows (LUW) 10.5 (GA’d 2013) and the query engine of the IBM Smart Analytics Optimizer for DB2 for z/OS V1.1 and the Informix Warehouse Accelerator products (2007-2010). Dr. Lohman was the architect of the Query Optimizer of DB2 LUW and was responsible for its development from 1992 to 1997 (versions 2 – 5), as well as the invention and prototyping of Visual Explain, efficient sampling, the DB2 Index Advisor, and optimization of XQuery queries in DB2. Dr. Lohman was elected to the IBM Academy of Technology in 2002 and made an IBM Master Inventor in 2011. He was the General Chair for ACM’s 2013 Symposium on Cloud Computing and is the General Co-Chair of the 2015 IEEE International Conference on Data Engineering (ICDE). Previously, he was on the editorial boards of the “Very Large Data Bases Journal” and “Distributed and Parallel Databases”. He is the author of over 75 papers in the refereed academic literature, and has been awarded 39 U.S. patents. His current research interests involve disruptive machine architectures for Business Intelligence, advanced data analytics, query optimization, self-managing database systems, information management appliances, database compression, and autonomic problem determination.

Systems & Databases: Let’s Break Down the Walls

phil
After hanging out exclusively with the database community for 35+ years, I’ve recently become more involved with the systems research community. I have a few observations and recommendations to share.Much of the work published in systems conferences covers topics that would have a natural home in database conferences. For example, transactions and data streams are currently in vogue in the systems field. Here are some recent Best-Paper-award topics: data streams at SOSP 2013, data-parallel query processing at OSDI 2008, and transactions OSDI 2012 and SOSP 2007. Although this topic overlap is increasing lately, it is not just a recent phenomenon. In the last 10+ years, the systems community has taken an interest in data replication, fault tolerance, key-value stores, data-parallel processing (i.e., map-reduce), query processing in sensor networks, and record caching. Some of this work has had a major impact on the database field, e.g., map-reduce and multi-master replication.Yet despite this overlap of topics, there has been remarkably little attendance of database researchers at systems conferences or of systems researchers at database conferences. Not a good thing. And until 2013, I’ve been part of the problem.

To help break down this barrier, SIGMOD and SIGOPS have sponsored the annual Symposium on Cloud Computing (SoCC) since 2010. Personally, I’ve found attending SoCC’s to be time well spent, and I had a good experience attending ICDCS 2013 in July. So I decided to attend my first SOSP in October, 2013. It was great. The fraction of that program of direct interest to me was as large as any database conference, including work on transactions, data streams, replication, caching, fault tolerance, and scalability. I knew there wouldn’t be a huge number of database attendees, but I wasn’t prepared to be such an odd duck. At most a dozen attendees out of 600 were card-carrying members of the database community. With few database friends to hang out with, I got to meet a lot more new people than I would at a database event and hear about projects in my areas that I’d wouldn’t otherwise have known about. Plus, I could advertise my own work to another group of folks.

If you do systems-oriented database research, then how about picking one systems conference to attend this year? NSDI in Seattle in April, ICDCS in Madrid in late June, or OSDI in Colorado in October. And if you see a systems researcher looking lonely at a database conference, then strike up a conversation. If they feel welcome, perhaps they’ll tell others and we’ll see more systems attendees at database events.

Given the small overlap of attendees in systems and database conferences, we shouldn’t be surprised that the communities have developed different styles of research papers. Since systems papers typically describe mechanisms lower on the stack, they often focus on broader usage scenarios. They value the “ities” more than database papers, e.g., scalability, availability, manageability, and security. They favor simple ideas that work robustly over complex techniques that report improvements for some inputs. They expect a paper to work through the system details in a credible prototype, and to report on lessons learned that are more broadly applicable. They expect more micro-benchmarks that explain the source of performance behavior that’s observed, rather than just benchmarks that model usage scenarios, such as TPC.

From attending systems conferences, serving on SoCC program committees (PC’s), and submitting an ill-fated paper to a systems conference, I learned a few things about systems conferences that we database folks could learn from:

1. More of their conferences are single-track, and they leave more time for Q&A. This leads to more-polished presentations. When you’re presenting a paper to 600 attendees, doing it badly is a career-limiter.

I’ve always liked CIDR, not only because of its system-building orientation, but also because it’s single-track. I’m forced to attend sessions on topics I normally would ignore, and hence I learn more. I think we should try making one of our big conferences single-track: SIGMOD, VLDB or ICDE. One might argue that too many excellent papers are submitted, so it’s impractical. I disagree. Here’s one way to do it: Have the same acceptance rate as in the past, and all papers are published as usual. However, the PC selects only one-track’s worth of papers for presentation slots. The other papers are presented in poster sessions that have no competing parallel sessions. The single-track enables us to learn about more topics, the density of great presentations is higher, and the poster sessions enable us to dig deep on papers of interest to us (as some of our DB conferences already enable us to do). Unless our field stops growing, we really have to do something like this. Our conferences have already gotten out of hand with as many as seven parallel sessions.

2. When describing related work, systems papers cast a wider net. Their goal is often not simply to demonstrate that the paper’s contributions are novel, but also to educate the reader about lines of work that are loosely related to the paper’s topic. To help enable this breadth, they’ve recently settled on a formula where references don’t count toward the paper’s maximum page count. We should adopt this view in the database community. One self-serving benefit is that papers would cite more references, which would increase all of our citation counts.

3. There’s a widespread belief that systems PC’s write longer, more constructive reviews. I think DB PC’s have been improving in this respect, but on average we’re not up to the systems’ standard.

4. They usually shepherd all papers, to ensure that authors incorporate recommended changes. This also gives the authors someone to ask about ambiguous recommendations. We should do this too.

5. They typically require 10pt font. As a courtesy to those of us over a certain age, whose eyes aren’t what they used to be, we should do this, and increase the page count to maintain the same paper length.

6. At SOSP and OSDI, they post all slide decks and videos of all presentations. Obviously worthwhile. We sometimes post slide decks and videos only for plenaries. Let’s do better.

There are a few aspects of systems conferences that I’m less enthusiastic about.

1. Systems conferences favor live PC meetings. They do have benefits. It’s educational for PC members to hear discussions of papers they didn’t review and for young researchers to see how decisions are reached. Since all PC members hear every paper discussion, non-reviewers can bring up points that neutralize inappropriate criticisms or raise issues that escaped reviewers’ attention, which reduces the randomness of decisions. However, there are also negatives. In borderline cases, the most articulate, quick-thinking, and extroverted PC members have an inappropriate advantage in getting their way. And inevitably, some PC members can’t attend the meeting, due to a schedule conflict or the travel expense, so the papers they reviewed get short shrift.

On balance, I don’t think the decisions produced by live PC meetings are enough better than on-line discussions to be worth what they cost. Perhaps we can get some of the benefits of a PC meeting by allowing PC members to see the reviews and discussions of all papers for which they don’t have a conflict, late in the discussion period (which we did some years ago). With large PC’s, this might constitute public disclosure, in which case patents would have to be filed before submission, which will delay the publication of some work. But if we think a broader vetting of papers is beneficial, this might be worth trying.

2. During the reviewing process of a systems conference, if a paper seems borderline, then the PC chairs solicit more reviews. These reviews do offer a different perspective on the interest-value of the work. But they are usually not as detailed as the initial ones and may be by less-expert reviewers. It’s common to receive 5 or 6 reviews of a paper submitted to a systems conference. As an author it’s nice to get this feedback. And it might reduce the randomness of decisions. But it significantly increases the reviewing load on PC’s. In database PC’s, we rarely, if ever, do this. In my opinion, it’s usually better to hold reviewers’ feet to the fire and make them decide. Still, we’d benefit from getting 4 or 5 reviews more often than never, but not as often as systems conferences.

3. Systems conferences publish very few industry papers and rarely have an industry track. In my experience, an industry paper is judged just like a research paper, but with a somewhat lower quality bar. This is unfortunate. Researchers and practitioners benefit from reading about the functionality and internals of state-of-the-art products, even if they are only modestly innovative, especially if they are widely used. The systems community would serve their audience better by publishing more such papers.

4. The reviewing load for systems PC’s is nearly double that of DB PC’s, so systems PC’s are proportionally smaller. E.g., SOSP 2013 had 160 submissions and 28 PC members. If each paper had three reviews, that’s 17 reviews per PC member. With five reviews/paper, that’s 29 reviews per PC member. If DB conferences cranked up the reviewing load, then we’d probably participate in fewer PC’s, so the workload on each of us might be unchanged. A PC member would see more of the submissions in each area of expertise, which might help reduce the randomness of decisions … maybe. Overall, I don’t see a compelling argument to change, but it’s debatable.

5. Like some DB conferences, many systems conferences have adopted double-blind reviewing. I am not a fan. For papers describing a system project, I find it hard to review and, in some cases, impossible to write a paper for a double-blind conference. I have chosen not to submit some papers to double-blind conferences. I know I am not alone in this.

In summary, the systems and DB areas have become much closer in recent years. Each community has something to learn from the other area’s technical contributions and point-of-view and from its approach to conferences and publications. We really would benefit from talking to each other a lot more than we do.

Acknowledgments: My thanks to Gustavo Alonso, Surajit Chaudhuri, Sudipto Das, Sameh Elnikety, Mike Franklin, and Sergey Melnik for contributing some of the ideas in this blog post.

Blogger’s Profile:
Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. Over the past 35 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University and Wang Institute of Graduate Studies, and a VP Software at Sequoia Systems. During that time, he has published papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major focus of his research. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences and a member of the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

Are Databases ready for Gestural Workloads?

arnab

In 2011, I was traveling on the Kolkata Metro when I made an interesting observation. Nearly every passenger around me was using a computing device of some sort. Some were playing with their smartphones, some with their tablets, some listening to music on their iPod Touches. Each was a consumer of structured data of some sort; interacting with their email, Facebook, or music catalog. Weeks later, while on a flight, this pattern came up again — every seat in the plane was equipped with a touch-driven entertainment system, some used while sifting through book pages on e-Readers. Clearly, the notion of a “computer” for the future generation looks very different from the past!

Over time, we seem to be converging to a vision where we are surrounded by computing devices of varying capacities, with users performing both simple and complex tasks, with a common attribute — these devices do not possess keyboards. In 2011, smartphones and tablets outsold workstations and portable computers by 1.5x. In 2013, this ratio is projected to reach 4x. Based on this trend, non-keyboard interaction (typically in the form of gestures) is on-track to be the dominant mode of data interaction.

growth

Gesture-driven applications pose a very different workload to underlying databases than traditional ones. Consider the idea of scrolling through a page of text, which is typically implemented as a SELECT / LIMIT query for each page, with page transitions implemented as successive queries. Most touch and gestural interfaces now implement inertial scrolling, where the speed of transitions can be superlinear. With a few successive swipes, the frontend could easily overload the query queue with successive SELECT / LIMIT queries, one for each page. Ironically, due to the high speed of scrolling, most of these results are moving too fast, and are hence ignored by the user. So, the question arises: Are today’s databases capable of handling gestural workloads? If not, what parts of the database should we fix? Should we devise a new query scheduler that can rapidly reprioritize / kill queries? (e.g., pages that the user has skipped no longer need to be queried or returned) Should we investigate bounded-latency execution? (e.g., the application needs results within Xms) Or should we change the query paradigm altogether? (e.g., query for a preview of results while scrolling is in motion, and then detailed results as the view stabilizes)

Another example of an interesting gestural interaction is that of the Interactive Join. As part of the GestureDB system we are developing in my group, the GestureQuery multitouch interface lets you compose two relations into an equijoin by simply dragging them together on the specific attributes:

In the interactive join, assuming all attributes are of the same data type, there could be M x N possible JOIN combinations, but only one is the correct intended query. We model this gesture as a query intent, a probability distribution over the space of possible queries(i.e. the M x N possible joins). Initially, the likelihood of each query is uniform, but while the user articulates the gesture, we can use the distance between attributes to compute the probability of each query.

The following interactive demo lets you play around with a real multitouch interaction trace from one of our user studies, collected from interactions with our GestureQuery iPad prototype2. The subject was asked to perform an equijoin Album.ArtistId with Artist.ArtistId, dragging the two tables close to each other. Dragging the slider left to right lets you step through the recorded trace. Notice how rapidly the “Proximity Score” changes during the interaction.

This interaction is an great example of query intent transition. As the gesture is being articulated, the distribution changes rapidly. The “Proximity Score” is the score for Album.ArtistId ⋈= Artist.ArtistId, the inverse of the distance between the two tables, which can be considered the likelihood for the intent — there is a score for each pair of attributes. Thus, in order to provide a preview for the intended join, this would involve quickly executing and displaying results while the user is articulating the gesture, basing the priority on these scores.

Just like the previous example, challenges arise here. Can a database provide join results at such low latency for such a diverse number of queries being fired at the same time, given such rapidly changing probabilities? It is not practical for the database to execute them all in parallel. Given that we know the domain of possible queries, is there scope for multiquery optimization? Second, given an interactive threshold of say 100ms for the results, is it feasible (or even useful) to generate full results for all queries? Should we consider partial query execution? Or can we simply sample the tables to provide interactive (albeit inaccurate) results? Third, feedback for a query is useful only for a fleeting moment during the gesture articulation. How could we show parts (i.e. rank tuples) of the result that are most useful?

Clearly, when working with gesture-driven interfaces and application layers, the gestural workloads of database queries pose a smorgasbord of challenges, that might require rethinking of the entire database stack. As these interfaces evolve, the underlying database systems will have to keep up with changes in user expectations and demands. With the explosion of new and novel gestural interfaces, natural interfaces and casual computing lately, it’s an exciting time for database research!

References:
1. Arnab Nandi: Querying Without Keyboards: CIDR 2013 (Vision Track)
2. Lilong Jiang, Arnab Nandi, Michael Mandel: GestureQuery: A Multitouch Database Query Interface: VLDB 2013 (demo)

Blogger’s Profile:
Arnab Nandi is an Assistant Professor of Computer Science & Engineering at The Ohio State University. Arnab’s research is in the area of database systems, focusing on challenges in big data analytics and data interaction. The goal of his group’s work is to allow end-users to explore and interact with semi-structured and structured data. Prior to joining Ohio State, Arnab did his Ph.D. at the University of Michigan, Ann Arbor, with stints at Google, Yahoo! Research, and Microsoft Research.

The Curses of Heterogeneity in Big Data

KLerman

Both theoretical and empirical research may be unnecessarily complicated by failure to recognize the effects of heterogeneity” – Vaupel & Yashin

Big Data is daily topic of conversation among data analysts, with much said and written about its promises and pitfalls. The issue of heterogeneity, however, has received scant attention. This is unfortunate, since failing to take heterogeneity into account can easily derail the discoveries one makes using these data.

This issue, which some may recognize as an example of ecological fallacy, first came to my attention via a paper elegantly titled “Heterogeneity’s ruses: some Surprising Effects of Selection on Population Dynamics” (Vaupel and Yashin, 1985). Authors discuss a variety of examples where the aggregated behavior of a heterogeneous population, composed of two homogeneous but differently behaving subpopulations, will differ from the behavior of any single individual. Consider the following example. It has been observed that the recidivism rate of convicts released from prison declines with time. A natural conclusion one may reach from this observation is that former convicts are less likely to commit crime as they age. However, this is false. In reality, there may be two groups of individuals “reformed” and “incorrigible” with constant – but different – recidivism rates. With time, there will be more “reformed” individuals left in the population, as the “incorrigibles” are sent back to prison, resulting in decreasing recidivism rate for the population as a whole. This simple example shows that “the patterns observed [at population level] may be surprisingly different from the underlying patterns on the individual level. Researchers interested in uncovering these individual patterns, perhaps to help develop or test theories or to make predictions, might benefit from an “understanding of heterogeneity’s ruses.” (Vaupel & Yashin)

My colleagues and I have been tricked by heterogeneity time and again. As one example, our study of information spread on the follower graphs of Twitter and Digg revealed that it was surprisingly different from the simple epidemics that are often used to model information spread. In a simple epidemic, described, for example, as an independent cascade model, the probability of infection increases monotonically with the number of exposures to infected friends. This probability is measured by the exposure response function. The figure below shows the exposure response function we measured on Twitter: the probability for becoming infected (i.e., retweet) information (a URL) on Twitter as a function of how many friends had previously tweeted this information. In contrast to epidemics, it appears as though repeated exposure to information suppresses infection probability. We measured an even more pronounced suppression of infection on Digg [Ver Steeg et al, 2011], and a similar exposure response was observed for hashtag adoption following friends’ use of them [Romero et al, 2011].

Figure: Exposure response on Twitter. Probability a user will retweet a message containing a URL after a number of friends have tweeted about it. Retweet probability is averaged over all users.
figure1

It is easy to draw wrong conclusions from this finding. In “What stops social epidemics?” [Ver Steeg et al, 2011], we reported that information spread on Digg is quickly extinguished, and attributed this to the exposure response function. We speculated that initial exposures “inoculate” users to information, so that they will not become infected (i.e., propagate it) despite multiple exposures. Now we know this explanation was completely wrong.

Figure: Exposure response of subpopulations of Twitter users, differentiated according to the number of friends they follow.
figure2

The exposure response function, while aggregated over all users, does not describe the behavior of any individual Digg or Twitter user – even the hypothetical “typical” user. In fact, there is no “typical” Twitter (or Digg) user. Twitter users are extremely heterogeneous. Separating them into more homogeneous sub-populations reveals a more regular pattern. Figure 2 shows the exposure response function for different populations of Twitter users, separated according to the number of friends they follow (large fluctuations are the result of small sample size). Why number of friends? This is explained in more detail in our papers [Hodas & Lerman 2012, 2013], but in short, we found it useful to separate users according to their cognitive load, i.e., the volume of information they receive, which is (on average) proportional to the number of friends they follow [Hodas et al, 2013]. Now, the probability that a user within each population will become infected increases monotonically with the number of infected, very similar to the predictions of the independent cascade model.

Figure 2 has a different, more significant interpretation, with consequences for information diffusion. It suggests that highly connected users, i.e., those who follow many others, are less susceptible to becoming infected. Their decreased susceptibility in fact explains Figure 1: as one moves to the right of the exposure response curve, only the better connected, and less sensitive, users contribute to that portion of the response. However, despite their reduced susceptibility, highly connected users respond positively to repeated exposures, like all other users. You do not inhibit response by repeatedly exposing people to information. Instead, the reason that these users are less susceptible hinges on the human brain’s limited bandwidth. There are only so many tweets any one can read, the more tweets you receive (on average proportional to the number of friends you follow), the less likely you are to see – and retweet – any specific tweet. If it was not for recognizing heterogeneity, we would not have found this far more interesting explanation.

References

  • Vaupel, J. W. and Yashin, A. I. (1985). Heterogeneity’s ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3):176-185.
  • Hodas, N. O. and Lerman, K. (2013). The simple rules of social contagion.
  • Hodas, N. O., Kooti, F., and Lerman, K. (2013). Friendship paradox redux: Your friends are more interesting than you. In Proceedings of 7th International Conference on Weblogs and Social Media.
  • Steeg, G. V., Ghosh, R., and Lerman, K. (2011). What stops social epidemics? In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media.
  • Romero, D. M., Meeder, B., and Kleinberg, J. (2011). Differences in the mechanics of information Diffusion Across topics: Idioms, political hashtags, and Complex Contagion on twitter. In Proceedings of World Wide Web Conference.
Blogger’s Profile:
Kristina Lerman is a Project Leader at the University of Southern California Information Sciences Institute and holds a joint appointment as a Research Associate Professor in the USC Computer Science Department. After a brief stint as a theoretical roboticist, she found her calling in blending together methods from physics, computer science and social science to address problems in social computing and social media analysis. She writes many papers that are greatly enjoyed by all of their twenty readers.

Observation: CSers do not value themselves, nor the results of their work?

IMG_1386

The title I chose is triggered by the lack of concern shown by computer scientists when public disclosures that companies like Apple, Google, HP own pay amazingly low taxes. These companies are derided in the press [look here for references]. The critiques provide little background, and often just repeat assertions that it’s all perfectly legal. Why? Corporate taxes are levied on reported US income, and their income is a share of the sales of software and devices incorporating software. That share can be greatly reduced by showing that the rights to the intellectual property (IP), embodied in the products, are held offshore. All multinational computer enterprises can and actually do avoid much taxation by having transferred a large fraction of the rights to their IP to shell companies located in taxhavens, typically subsidiaries of their offshore operations. The true value of such transfers is hard to assess, especially if the IP transfers are made before the products are offered for sale.

Since the IP in their products comprises most of the value of the revenues, most of the profits from selling or exploiting software now flow into taxhavens. Other costs of production, as assembling hardware, duplicating CDs, or shipping products over the Internet are commodity operations, adding little value. At the same time, the actual intellectual property, and the employees that create and improve that IP are praised, as in Apple’s label “Designed in California”, while its valuation is left to accountants, without input by the creators.

It is true that the value of intellectual capital, the creators and the IP they generate, is hard to measure. But being hard has not stopped innovators from trying to solve a problem, and the principles are simple, the financial value of any good is what people are willing to pay for it in the future, discounted to today.

The low financial value assessed to IP transfers to taxhavens mirrors what bookkeepers do. The reported book value of a company is obtained by adding up all visible assets and the financial capital resources of a company, both in the US and held offshore. Investors know better. Shareholders value the company at a market capitalization, the product of the share price and the number of shares in the market. Those market values are 3 to 6 times more than the book values of modern companies. To a company’s shareholders it may not matter that the financial capital and IP rights are held in offshore mailbox accounts, they care about future income. The total amount of capital held offshore by high-technology multinationals is now over 10% of the annual U.S. GDP. Some large companies hold enough that just the taxes they avoided could pay for a year of NSF and DARPA grants combined.

Many computer scientists believe that software – their output – should be a free and common good. But only a fraction of them work pro bono or for organizations supported by grants. The remainder have not convinced the companies they work for to forego income. Avoiding taxes not only a US issue, taxes on IP-based income are avoided worldwide. Operations of multinationals in India or Ireland do not generate profits due to the intellectual capital employed locally either, those rights are also allocated to taxhaven accounts. For instance the low 12.5% Irish tax rate is only applied to a, say 5%, commodity markup of products assembled there. Having good jobs also in countries that want to move up the economic ladder is, of course, a complementary and worthwhile objective.

Should computer professionals care about the exploitation of their work? Other creative disciplines are aware of the value and dispersion of their work, certainly architects, automobile designers, etc. Those professionals do not expect that the results of their work should be free. Nor do they expect that that their employers should escape taxation when the outcome of their work is sold.

Computer scientists do lobby for better and wider education, for computers for every student, for research funding for their projects – all to be paid by taxes on everything and everybody except their products. Government initiatives to help innovation are now hard to motivate, since the models used by economists ignore the contributions of intellectual capital, while focusing on interest rates and incentives that were meaningful 50 years ago.

The SIGMOD blog is not intended for any ongoing discussions, but comments and feedback will be welcome, and indicate that some of us care.

Blogger’s Profile:
Gio Wiederhold is professor emeritus of Computer Science at Stanford University, with courtesy Appointments in Electrical Engineering and Medicine. He has over 400 publications in these fields. Gio holds a PhD from the University of California, San Francisco and an honorary D.Sc. from the National University of Ireland, Galway. He has been elected as a fellow of the ACM, the IEEE, and the ACMI.
Gio was born in Italy and lived in Germany and The Netherlands. He emigrated to the United States in 1958. He worked sixteen years in industry as a programmer, software designer, manager, and division director. He became a professor at Stanford in 1976, advising PhD students in several departments, integrating concepts from multiple disciplines. His students now have responsible positions in finance, academia, and industry. He currently teaches two courses at Stanford: Business on the Internet and Software Economics.
Breaks from Stanford include tours as visiting professor at IIT Kanpur in India, at EPFL in Switzerland, representing UNDP at projects in Asia, as a researcher at IBM Germany, and as program manager at DARPA of the US Department of Defense, where he initiated the Intelligent Integration of Information (I3) program. I3 supported innovative programs. as the Digital Library, leading to several significant Internet applications and companies, including Google. For the last 10 years Gio consulted exclusively for the US Treasury on technical and business aspects of valuation of intellectual property. He recently published a book: Valuing Intellectual Capital, Multinationals and Taxhavens; series Management for Professionals, Springer Verlag, New York , August 2013, integrating technical, business, and economic concepts to allow consistent valuation of IP while demonstrating the problems due to inadequate valuation of rights to high-technology exports, as described in an earlier paper with Amar Gupta, and Erich Neuhold: “Offshoring and Transfer of Intellectual Property”; Information Resources Management Journal (IRMJ) Vol.23 No.1, January-March 2010, pp.74-93.