Sihem Amer-Yahia
Sihem Amer-Yahia

Quality Experiments Crowdsourced


Crowdsourcing is a powerful paradigm that democratizes the creation, collection, analysis, curation, and dissemination of data, contributed by millions of individuals, who are called workers. Early crowdsourcing platforms in the context of citizen reporting and citizen sciences were dedicated to specific tasks. Today, many other platforms that expect workers with different expertise exist. Among them, we can find Turkit, Mob4hire, uTest, Freelancer, eLance, oDesk, Guru, Topcoder, Trada, 99design, Innocentive, CloudCrowd, and CloudFlower. The most popular one is Amazon Mechanical Turk (AMT) that is generic and allows any requester to post virtually any kind of task in which any worker can participate via self-appointment.

Tasks can be as simple as binary requests such as recognizing a landmark in a city, or more sophisticated ones such as meta-data enrichment (picture and video tagging in Flickr or YouTube, food description in Open Food Facts), opinion solicitation (restaurants reviews), collaborative intelligence (city map alignment in, identifying stars in GalaxyZoo, character recognition in Recaptcha), and knowledge-intensive crowdsourcing with tasks such as collaborative editing (Wikipedia) and idea generation. The database community has been offering its own platforms (e.g., Qurk on top of AMT and Deco for query processing) and today, scientists are resorting to crowdsourcing for a particular kind of task, that is, quality experiments to validate their findings.

There are several challenges one faces in crowd-based quality experiments. Some are inherent to using a crowd. For example, worker filtering to make sure only qualified ones participate, has to be done by task requesters. Those qualification tests are often difficult to design and do not always lead to flawless experiments. Data collection to build worker profiles is also a challenge (e.g., workers are asked to rate at least 30 movies to build a profile). Other challenges depend on the actual platform. The availability of a large worker pool and the ability to easily set up virtually any task, make AMT the platform of choice for most research experiments. In fact, AMT is used by most researchers in the DB community, and tasks range from answering simple binary questions such as determining whether a tweet expresses a positive or a negative sentiment, to more sophisticated tasks that require some domain knowledge such as collaborative journalism. However, the closed nature of AMT (and of all crowdsourcing platforms we know of today) forces task requesters to find workarounds outside the system in order to select the right pool of workers and assign to them the right set of tasks (AMT is often used as a hiring platform only and workers are redirected to another website). Moreover, monetary compensation, as is the case in AMT, is not always the best choice for incentivizing workers (think open-source contributions) and tends to attract more cheaters when the reward is too high. To address that, post-quality control is enforced via majority voting or by monitoring task completion time. The bottom line is that in a volatile environment, enforcing some level of correctness is difficult, time-consuming and costly.

I have used AMT in several qualitative analyses. The first experiment was in 2009 [1] to evaluate travel itineraries extracted from Flickr against carefully handcrafted ones. It was a large-scale experiment “to validate the wisdom of the crowd”. About 450 workers participated in the experiment and evaluated itineraries in 5 popular and geographically distributed cities (San Francisco, New York City, London, Paris and Barcelona). Half of the workers were involved in an independent study that evaluated the usefulness of our itineraries, the points of interests they contained and the landmark visit and transit times. The other half participated in a comparative study where our itineraries were put side-by-side with handcrafted ones and were preferred in a large majority of the cases. It worked so well that the Wall Street Journal, a prominent newspaper in the US, talked about it [2]. In this experiment, special attention was dedicated to the design of qualification tests to filter out workers. We had to think carefully about a test for which answers are not readily available with simple Web search. Below is one of the tests we used.


The second time we used AMT [3] relied on input from 100 workers to evaluate how well different rating aggregation functions (least misery function, majority voting, pairwise disagreement, and global average) to compute movie recommendations from user groups with different characteristics (small/large groups, similar/dissimilar groups). We found several insightful and sometimes surprising results including that least misery, i.e., optimizing for the lowest rating in a group, performed better than averaging ratings and is the best aggregation function for small groups formed by similar users. Pairwise disagreement on the other hand, did very well with large groups of dissimilar users. In the first round of that experiment, we had inconsistent findings. We then looked carefully in AMT logs and realized that there were cheaters whose task completion time was unrealistic.

The latest work we used AMT for was crowdsourcing the validation of articles produced collaboratively by a crowd to another crowd [4]. Using the crowd to validate the crowd is in fact a common practice and has shown to work well in many applications such as sentence translation by non-experts. Participating workers were instructed to join a group and together write a short summary of current world events on some topic. The qualification test included answering factual questions on those events. Of course, cheaters could use their favorite search engine to find answers to the qualification test.

I hope that the examples above illustrate both the advantage of having a generic platform that enables reaching out to a large worker pool but also showcases the challenges of using crowds for qualitative analysis. Using crowds bears similarities to running surveys and to using cohorts in medical drug trials. While there are practices on how to select such cohorts and ensure result quality (e.g., controlled/test groups, independently chosen subjects, double blind tests in medical trials), none exist for running a crowd-based quality experiments.

With my colleagues Beatrice Valeri and Shady El Bassuoni, we examined 4395 papers published in SIGMOD, PVLDB, ICDE, EDBT, ICDM, KDD, ICWSM, RecSys, and Hypertext between 2010 and 2014 (2014 statistics are partial) and found that 60 of them used crowd-based quality experiments. Among those, all of them use AMT or CrowdFlower, a layer on top of AMT. Most tasks are about collecting golden data used in evaluation and among those, labeling data is the most common. 48 out of the 60 papers have at least one author in the USA. European authors contributed to a total of 14 papers followed by Israel (6 papers), China (4 papers), Qatar and Canada (3 papers each), Singapore and Japan (2 papers each). As shown below, the number of publications that crowdsource experiments has been increasing with a worker base ranging from around 20 to 500 per experiment.



Most papers use workers’ acceptance rate (recorded in AMT based on previous tasks completed by those workers) and only a few use a thorough qualification test. Those tests are followed by a post-processing step where majority voting, manual checking (by involving another crowd), task completion time, and answer justification, are used to check workers’ input. There goes design efficiency.



What is our opportunity today?
Crowd-based quality experiments in the database community are recent. In addition, research in crowdsourcing is gaining traction in our community. What could we do to make a difference? We need a crowd, a platform and some wisdom. Lab experiments have historically been conducted with a small number of colleagues and students who are physically present in a lab. We need to get them online. We also need a platform with returning workers that can be profiled, a platform where we can conduct crowd-based quality experiments but also test our own research on crowdsourcing: worker profiling strategies, our task-to-worker assignment algorithms. That would reduce the burden on all of us when reporting quality experiments on papers since it would be enough to point to results summarized on a single website (double-blind reviewing could be enforced). Such a platform would need to be generic in order to allow any kind of task and any worker to register, open in order to deploy our own crowd-based algorithms and advance the state of research in crowdsourcing, and free!

I do not expect our community to suddenly change the course of things. I myself still run experiments on AMT and will continue to do it. Of course, there are questions related to where to start and how to maintain the platform. Crowd4U ( is a good place to start. It’s an all-academic generic platform that is being developed at the Univ. of Tsukuba and that is open to the rest of the academic world. Here in Grenoble, we ran a campaign on campus where we advertised crowdsourcing and gave out cookies and drinks (not totally free!) and passers-by performed over 1600 tasks to label tweets in less than 3 hours.

You can start your own platform or join Crowd4U but keep it open! In [4], we argue that human factors such as worker skills, worker availability, expected wage, and ability to work together, could be accounted for to make better use of the crowd. Such parameters can only be tested in an open and generic crowdsourcing system where task assignment and skill learning algorithms can be deployed for virtually any task.

So let’s take this opportunity, have some wisdom, and build and nurture our own crowdsourcing platform(s).

[1] Munmun De Choudhury, Moran Feldman, Sihem Amer-Yahia, Nadav Golbandi, Ronny Lempel, Cong Yu: Automatic construction of travel itineraries using social breadcrumbs. HT 2010: 35-44


[3] Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, Cong Yu: Space efficiency in group recommendation. VLDB J. 19(6): 877-900 (2010)

[4] Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das: Optimization in Knowledge-Intensive Crowdsourcing. CoRR abs/1401.1302 (2014)

Blogger’s Profile:
Sihem Amer-Yahia is a 1st class CNRS (Centre National de Recherche Scientifique) Research Director at LIG (Laboratoire d’Informatique de Grenoble) in France. Sihem heads the SLIDE group (ScaLable Information Discovery and Exploitation) that sits at the intersection of large-scale data management and Web data analytics with an emphasis on the social and the semantic Web. Until July 2012, Sihem was Principal Scientist at the Qatar Computing Research Institute (QCRI) where she led a group in Social Computing and worked with local Universities on student mentoring and with Al Jazeera Online on news traffic analytics. From 2006 to 2011, she was Senior Scientist at Yahoo! Research and worked on revisiting relevance models and scalable top-k processing algorithms for Delicious, Yahoo! Travel and Personals, and Flickr. Before that, she spent 7 years at at&t Labs in New Jersey, working on XML query optimization and XML full-text search in conjunction with the W3C. Sihem has served on the SIGMOD Executive Board, is a member of the VLDB and the EDBT Endowments. She serves on the editorial boards of ACM TODS, the VLDB Journal and the Information Systems Journal. She was track chair of SIGIR 2013 and of PVLDB 2013. She is PC chair of EDBT 2014 and will be PC chair of BDA 2015 (French DB conference) and PC co-chair of SIGMOD Industrial 2015. Sihem received her Ph.D. in Computer Science from Paris-Orsay and INRIA in 1999, and her engineering degree from ESI/INI, Algeria.

Copyright @ 2014, Sihem Amer-Yahia, All rights reserved.