June 25, 2013
Like all computer science and engineering departments, at Arizona State University (ASU), we regularly re-assess the impacts of our research and educational programs based on feedback from users of the technologies we develop and companies that hire our graduates.
Last January, as part of the most recent of these efforts, we (in collaboration with IBM) organized a Data Services and Analytics Roundtable at ASU to discuss the challenges, approaches, and needs of the industry with respect to data services and analytics. The roundtable brought together executives, researchers, and engineers representing various industry sectors (including big data producers, consumers, and technology and service providers). Participants included representatives from the aerospace and defense industry (e.g. AeroJet), local government (e.g. City of Phoenix), technology representatives that regularly handle big data sets (e.g. Avnet, GoDaddy, and IO Data Centers), and data solution developers (e.g. IBM and Oracle).
In order to help contextualize the day’s discussions and to have a baseline understanding of the participants view of the state of the art, as the very first task of the day, we presented the participants with 35 data services and analytics knowledge competency areas we had identified in advance and asked the participants to rate these areas in terms of:
– current and future relevance to the business,
– their (and their competitors’) current and desired states in these areas, and
– whether they think there is need for further research and technological investments.
We also asked the participants whether they are currently involved in technology development in any of these areas (the list of the areas knowledge competency areas we considered is included below).
Needlessly to say, the list of knowledge competency areas we presented to the participants was unavoidably imperfect and incomplete. Moreover, the discussions and results that have emerged from the participants’ ratings represented the biases and perceptions of the specific group of industry representatives that participated in the roundtable. Nevertheless, I think the list of six most critical knowledge competency groups (in terms of the value gap – i.e., the difference between current and desired states of the knowledge area) that emerged out of these ratings is quite interesting.
Six most critical knowledge competency groups
First of all, while we were expecting to see temporal and spatial data analyses somewhere in the short list, it was rather surprising to have these two at the very top of the list. Given that spatial and temporal data analyses are probably among the two most well-studied areas with long histories of innovation, it was interesting to see that the folks at this roundtable thought that the state-of-the-art is still way off where it should be by now.
It was also interesting that, for this group of representatives, “performance” and “scalability” (if not the need for “real-time” processing of data in motion) were somewhat less critical than the other key issues, such as summarization, cleaning, visualization, and integration. It looked liked the message we were being given was “yes the data scale is increasingly critical — but not just because it is too difficult to process all these data, but primarily because users are still not satisfied with the tools that are available to make sense out of the data.”
After this, the roundtable continued with sessions where groups of participants further discussed the challenges and opportunities in data services and analytics. Below is a very brief summary of the outcomes that emerged from these discussion sessions:
The amount of data being generated is massive. This necessitates new data architectures with lots of processing power and tools (including collaboration tools) that can match the scale of the data and support split second decision making, through data fusion and integration and analysis and forecasting algorithms, to help non-data-experts (both government and commercial) make decisions and generate value.
While batch processing techniques are acceptable for “data at rest”, for many business needs, the architectures have to support on-line and pipelined data processing to match the speed of the application and business. In particular real-time of processing of temporally and spatially distributed observations for “data in motion” is needed by many applications, including those that include human behavior modeling at individual and population scales. Credit card authentication, fraud detection systems financial services, retail, mobile/location-aware services, and healthcare are just a few examples of applications with this critical need. Tools that support entity analytics, (social and other) network analytics, text analytics, and media analytics are in increasing demand, not only for traditional applications like monitoring and security, but also for emerging applications, including enabling interest detection for retail/advertisement, social media, and SmartTV. Privacy and security are also key properties of many of these applications.
Federated data storage, analysis, and modeling are still critical for many businesses. Given that many government agencies and business opt to deploy common IT infrastructures and that most incoming data is unstructured (or at least differently structured), frameworks that can make unstructured data queriable, prioritize and rank data, correlate and identify the gaps in the data, highlight what is normal and not normal, and automate the ingest of the data into these common infrastructures can also have significant impact. These data architectures need to also support reducing the size or dimensionality of the data to make data amenable to analysis. The novel learning algorithms need to be able to take into account for known models, but also adapt the models to new emerging patterns. Underlying systems should also support going back in history to validate models and going forward into future to support forecasting and if-then hypothesis testing.
The above clearly necessitates a new crop of data scientists with solid algorithmic and mathematical background, complemented with excellent data management, programming, and system development/integration skills. Data visualization and media information extraction are also important technical skills. These data scientists should also be able to identify data that is important, restructure data to make it useful, interpret data, formulate observation strategies and relevant data queries, and ask new questions based on the observations and results – including “what happened?”, “why did it happen?”, and “what happens next?”.
Increasingly no one single data management and analysis technology is able to address the needs of all applications; moreover technologies (both in application side as well as in the data management space) evolve rapidly. Therefore, a program educating these data scientists need to shy away from teaching one DBMS or one data management paradigm (e.g. row-stores, column-stores), but need to teach fundamentals for the graduates to pick and deploy the appropriate DBMS with the suitable structured or unstructured data model for the particular task and business needs. The program should not only teach the currently dominant technologies, but also introduce up-and-coming as well as legacy technologies and provide students with the skills necessary to keep up with the technology development. In particular, the program should address to bridge the ever increasing knowledge gap due to the proliferation of data management, processing, and analysis systems (including commercial and open-source) and help data scientists to make informed architectural decisions based on a good understanding on how available technologies differ and complement each other.
A key skill for these new data scientists, beyond their technical excellence, is an overall awareness of how businesses operate and appreciation for the impact of the data on the businesses. Moreover, to maximize impact, these data scientists need to have the necessary skills to communicate with non-technical co-workers, including business executives. Such a program needs to be inherently multi-disciplinary and include research, case studies, and representatives from the industry and government to provide a holistic perspective in research and teaching and have direct local and national impact.
Many of these observations are of course not surprising for those who are well-tuned with the needs of the industry and other user groups. Nevertheless, it was quite instructive that very consistent outcomes emerged from participant groups with diverse backgrounds. It was also interesting (both from educational as well as technological points of view) that participants highlighted that non-technical skills of data scientists and engineers, such as “collaboration” and “technical communication (including with non-technical co-workers)” are almost as critical as their technical skills. After all, many organizations are as complex and the usage contexts of the data are as diverse as the data themselves. Moreover, in a complex and layered organization, one’s learned result is nothing but the other’s raw data. Thus, an individual’s or group’s ability to make sense out of data may not be all that useful for a business if that individual or group does not also have the necessary skills and supporting tools to contextualize and communicate the learned results to the others in the organization.
Naturally, the outcomes of this roundtable will serve as a guide as we continue to develop our data management and analytics research and educational programs at ASU.
However, equally naturally, we are aware that no one single roundtable or workshop can unearth all the challenges and opportunities in a domain as diverse and fast developing as data management, services, and analytics. Thus, I (and I am sure many others following this blog) would love to hear your views and opinions.
K. Selçuk Candan is a Professor of Computer Science and Engineering at the Arizona State University. His primary research interests are in the area of efficient and scalable management of heterogeneous (such as sensed and streaming media, web and social media, scientific, and enterprise) data. He has published over 160 journal and peer-reviewed conference articles, one book, and 16 book chapters. He has 9 patents. He served as an associate editor of the Very Large Databases (VLDB) journal.
He is also in the editorial board of the IEEE Transactions on Multimedia and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06. In 2008, he served as a PC Chair for ACM Multimedia (MM’08). In 2010, he served as a program committee group leader for ACM SIGMOD’10 and a PC Chair for the ACM Int. Conf. on Image and Video Retrieval’10. In 2011, he served as a general co-chair for the ACM MM’11 conference. In 2012 he served as a general co-chair for ACM SIGMOD’12. He has successfully served as the PI or co-PI of numerous grants, including from the National Science Foundation, Air Force Office of Research, Army Research Office, Mellon Foundation, and HP Labs. He also served as a Visiting Research Scientist at NEC Laboratories America for over 10 years. He is a member of the Executive Committee of ACM SIGMOD and an ACM Distinguished Scientist.
Copyright @ 2013, K. Selçuk Candan, All rights reserved.
Comments are closed
Thanks for the interesting post! I teach a standard database course: ER design, SQL, RDBMS, optimization, indexing, etc. The same course is taught in most universities. My question is, what should I be teaching? And what textbook emphasizes spatio-temporal analyses, real-time stream processing, and data warehousing? I’d like to push these topics into an introductory DB course to develop “data scientists” but students tend to complain about topics not covered by the course textbook. Or would you see this as a separate course, in addition to a standard database course that uses white papers and on-line sources to cover the topics?