September 20, 2015
One of the key challenges you face in an industrial research lab is how to choose your projects. You want your projects to be interesting research but also contribute to your company. As a junior researcher, you’re typically in the situation of choosing a project to join, while later in your career you are expected to come up with and lead your own projects. Regardless of your age, you have to make an educated decision.
It is common wisdom that you should not choose a project that a product team is likely to be embarking on in the short term (e.g., up to a year). By the time you’ll get any results, they will have done it already. They might not do it as well as or as elegantly as you can, but that won’t matter at that point. This advice can be challenging to follow in a fast moving company like Google and its cohorts, in which a third of the engineers have Ph.Ds. On the other hand, you also don’t want to be too far ahead of your company’s trajectory, so you need to choose projects
that you can imagine applying once you have substantial results.
The best way to find good projects is talking with people in different parts of your company or directly with customers, if that’s a possibility. Such conversations also ensure that you are creating the personal relationships that will prove invaluable when it comes time to apply your technology. Thinking ahead to the technology transfer is also critical. I use the term “technology transfer” because it is commonly used, but I actually highly dislike the term. The idea of “transfer” often implies a handoff from the developers of the technology to the ones who put it to practice, thereby excluding some of the most effective ways that technology can make an impact.
However, here I want to emphasize a different piece of advice that guided most of my work at Google.
At Google, strengths are our abundance of data and our eager users willing to try out our new products. I will illustrate it with two examples.
It is a longstanding frustration in the database community that databases are too hard to use and most people who have data do not have the skills to set up and maintain a database application. Consequently, the majority of the world’s data is stored outside database systems. The vision underlying Google Fusion Tables (launched in June 2009) was to create a database system for the masses. Of course, we had to compromise on the features of the system, but the idea was to support several common data management workflows effectively and to enable a large number of data owners to enjoy the benefits of data management.
In doing so, we played to one of Google’s strengths, namely the presence of an international community of enthusiastic (but demanding) users who are willing to experiment with new products and help us push them forward rapidly. Within 24 hours of launching Fusion Tables, users from over 100 countries looked into the product, and the feedback started pouring in. We were able to validate our assumptions about which use cases are compelling and fine tune them.
The main workflow that Fusion Tables supported effectively is going from raw data (csv file, spreadsheet) to a meaningful visualization, which could also be embedded in the user’s own site. In doing so, we played to another one of Google’s strengths, our scalable mapping infrastructure. Visualizing data on maps turns out (not surprisingly at all) to apply to a broad range of domains. Fusion Tables was used in wide-ranging applications such as disaster relief (e.g., providing critical information to data in need after the 2011 Japan earthquake or before the NYC hurricanes), to being a favorite tool for journalists (to the extent that they administered tutorials on Fusion Tables at their own conferences), to visualizing the disparity between the location of patients with Multiple Sclerosis and the location of potential caregivers (thereby being able to channel appropriate resources to patients in need).
Figure 1: A visualization created using Google Fusion Tables after the 2011 Japan earthquake. The visualization shows which road segments were drivable after the earthquake and the tsunami. This and other visualizations got tens of millions of hits per day during times of need.
There was no technology transfer of Fusion Tables. We conceived, developed, launched and maintained the product completely within Google Research. Launching an external product from Google Research was tricky even in 2009 because contributions from research tend to enable or enhance existing products and not launch as standalones. However, Fusion Tables paid homage to Google’s mission to organize and make universally accessible the world’s information and there was no obvious place for it in the company. The downside of having the product completely within Research was that it didn’t have the resources of a normal product, which slowed down its growth. Witnessing how data can help people in need when it can be delivered to them in a timely fashion made for an exhilarating experience!
The second example project is WebTables – studying the collection of HTML tables on the Web and serving it to Google users in response to queries. This project started purely from intellectual curiosity (that struck Mike Cafarella and myself at the same time), and we believed that these tables must somehow be useful. The strength we played to in this case was that we could run one MapReduce on the Google index and get the initial corpus of HTML tables in a single afternoon! We then could inspect the query stream to estimate how many queries might be well served by HTML tables. Of course, we then spent the next few years trying to recognize the good tables that consisted of less than 1% of the corpus.
Unlike Fusion Tables, where we published a paper only after the product was launched, we published papers about WebTables before making a search service available, since we thought we were making contributions of interest to the community. For a variety of reasons, it took almost four years from the inception of the project until we launched a search engine for tables, and two more years until we reached the holy grail: displaying rows from HTML tables in the Google search result page and within snippets. To reach this goal the team had to guarantee very high precision of matching queries to HTML tables. On the way, we discovered an unanticipated application of HTML tables as a research tool in Google Docs.
Here too, technology was not transferred in the typical sense of the word. The initial search engine was launched by a team within Google Research. About 10 minutes after we got approval to launch HTML tables on the main Google search result page, a Googler from Search Quality approached me and suggested that several of the researchers who were responsible for making WebTables possible join his team. The move turned out to be a great and timely opportunity for a few of them, and they have been driving the process within the Search team since then. HTML tables would never have seen the light of day otherwise.
WebTables had to face another common challenge, which is that a research innovation has to fit into a collection of ideas that form a product or search service, and must demonstrate distinct value in that context. In particular, while we were working on HTML tables, Google Search underwent one of its major shifts, in which it started answering queries from our Knowledge Graph (KG). The KG contains facts about famous entities in the world (people, companies, locations) and it was geared to answer “head queries,” i.e., queries that occur most frequently in the query stream and can be answered by structured data. The promise of HTML tables was that they could find data to answer the long and very heavy tail of queries which the KG wouldn’t likely answer. However, as the KG effort expanded, the head became broader and broader, and every time the KG added another domain, that put another dent in the coverage of WebTables. Still, WebTables answers many queries that are not answerable by the KG and pleases users worldwide millions of times a day.
I’d like to conclude by sharing a few pieces of advice.
No matter how your role evolves, try to never stop coding (or if you cannot code, get involved in code reviews). You don’t necessarily need to be writing code that is on the critical path of your team (in fact, you probably shouldn’t) and you don’t have to be coding quarter after quarter, but you should try to do some coding on a regular basis. I was inspired to code by two of my good friends, Stanford professors Pat Hanrahan and Dan Boneh, both undisputed leaders in their fields, who are constantly involved in some coding activity. At Google, I found opportunities to code in the very initial stages of building a product that later morphed into Fusion Tables, while exploring a new idea with WebTables, and when building the entire first version of Biperpedia, a system for mining the long tail of schema attributes. My coding activities were always productive. They either launched a new project or a major change in direction of a project. They also enabled me to have discussions with my team members at a completely different level of detail (to everyone’s enjoyment). In a world where coding practices and tools are changing so rapidly, coding often helped me keep on top of things. Most of all, I enjoyed them and they reminded me of the fun that attracted me to our field many years ago!
Over the years, I realized there was a trait that correlated well with success when addressing complex and data-heavy tasks: the ability to obsessively look at great amounts of data until you gain insights. As Ph.D’s in Computer Science, we’re not necessarily trained to spend hours looking at data. We’re much better at obsessively tweaking the algorithm to get it to do the right thing, but we often get bored after looking at a handful of example data items (us database management folks are probably even worse than average!). At Google we have the gift of data, but it surprised me to see how often researchers and engineers would shy away from it. I have found time and time again that engineers and researchers that obsess over their data end up discovering the critical observations that enable them to develop effective algorithms. Additionally, when you present an idea, you sound much more compelling when you have examples rolling off your tongue.
As a former academic, I’m often asked about the difference between academia and industry. There are many answers to this, including the fact that in industry you need to solve the entire problem rather than cherry pick the nugget you wish to solve elegantly. The main difference, however, which can be very subtle and perhaps only obvious in hindsight is the following. (I originally heard this observation from Prabhakar Raghavan before joining Google, but it took me a decade to internalize it). In academia, as you pursue your passion for science and technology, the evaluation of faculty creates pressure to further your career as an individual, whether it is through publications, graduating and promoting your students or being an excellent educator. You are by necessity the champion of your own ideas, which I think is ultimately a critical ingredient to scientific progress. In contrast, to be successful in industry, whether in engineering or in research, you often need to put your individual goals and ideas aside. You need to find the most effective way to get to a great product or service (or part thereof) even if it means finding the right mix of ideas. You will be rewarded for contributing key ideas to a product, but you will be rewarded even more for getting the job done, collaborating effectively across teams, and pleasing customers. This advice may be even more useful for academics founding startups. Once you founded your company based on the great research you conducted at the university, the company takes a life of its own and the most important goal is to create a product that users want.
Much has been written about finding work life balance. My 2 cents are simple. You do not reach balance by reducing work. You reach balance by finding a passion that draws you out of work. Of course, family comes first on this ladder, but we often need some other passion. In my case, I had the most wonderful experience writing a book about coffee. I did not plan it as a worklife balance treatment, I just realized it in hindsight. With the exception of spending time with my kids, I found that the evenings spent researching and writing about coffee (let alone the exotic trips I had to take as part of this learning) gave me hours of respite without thinking about work.
No matter where you are or what you do, the most important thing is working with great people. I cannot possibly thank everyone I worked with at Google, but I’ll mention a few. Jayant Madhavan, Hector Gonzalez and Anno Langen were the force behind the early days of Google Fusion Tables. Mike Cafarella started the WebTables project and while he was an intern, led an army of interns to build the first version. Cong Yu and Boulos Harb turned the initial prototype of WebTables into an externally facing service and drove the creation and maintenance of the table corpus. Jayant Madhavan, Hongrae Lee, Sree Balakrishnan, Cong Yu, Boulos Harb and Afshin Rostamizadeh worked together to get HTML tables on the Google search result pages. Jayant Madhavan is my canonical example of a person who can obsess over data and glean incredible value from it. I’ve also been lucky to work with Luna Dong, Fei Wu, Chung Wu, David Ko, Alkis Polyzotis, Chris Olston, Natasha Noy, Rod McChesney, Sudip Roy, Steven Whang and Xiao Yu, three awesome sabbatical visitors Christian Jensen, Zack Ives and Sunita Sarawagi and over 40(!) summer interns.
Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the Database Group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). Halevy is the author of the book “The Infinite Emotions of Coffee”, published in 2011, and serves on the board of the Alliance of Coffee Excellence. He is also a co-author of the book “Principles of Data Integration”, published in 2012. Dr. Halevy received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem.
Copyright @ 2015, Alon Halevy, All rights reserved.
February 10, 2012
Comments are closed