Perspectives on the Evaluation of Recommender Systems Workshop at ACM Recommender Systems 2021


Evaluation is essential when conducting rigorous research in recommender systems (RS). It may span the evaluation of early ideas and approaches up to elaborate systems in operation; it may target a wide spectrum of different aspects being evaluated. Naturally, we do (and have to) take various perspectives on the evaluation of RS. Thereby, the term “perspective” may, for instance, refer to various purposes of a RS, the various stakeholders affected by a RS, or the potential risks that ought to be minimized. Further, we have to consider that various methodological approaches and experimental designs represent different perspectives on evaluation. The perspective on the evaluation of RS may also be substantially characterized by the available resources. The access to resources will likely be different for PhD students compared to established researchers in industry.

Acknowledging that there are various perspectives on the evaluation of RS, we want to put into discussion whether there is a “golden standard” for the evaluation of RS, and—if so—if it indeed is “golden” in any sense. We postulate that the various perspectives are valid and reasonable, and aim to reach out to the community to discuss and reason about.

The goal of the workshop is to capture the current state of evaluation, and gauge whether there is, or should be, a different target that RS evaluation should strive for. The workshop will address the question: where should we go from here as a community? and aims at coming up with concrete steps for action.

We have a particularly strong commitment to invite and integrate researchers at the beginning of their careers and want to equally integrate established researchers and practitioners, from industry and academia alike. It is our particular concern to give a voice to the various perspectives involved.

Call for Papers

Topics of interest include, but are not limited to, the following:

  • Case studies of difficult, hard-to-evaluate scenarios
  • Evaluations with contradicting results
  • Showcasing (structural) problems in RS evaluation
  • Integration of offline and online experiments
  • Multi-Stakeholder evaluation
  • Divergence between evaluation goals and what is actually captured by the evaluation
  • Nontrivial and unexpected experiences from practitioners

We deliberately solicit papers reporting problems and (negative) experiences regarding RS evaluation, as we consider the reflection on unsuccessful, inadequate or insufficient evaluations as a fruitful source for yet another perspective on RS evaluation that can spark discussions at the workshop. Accordingly, submissions may also address the following themes:

(a) “lessons learned” from the successful application of RS evaluation or from “post mortem” analyses describing specific evaluation strategies that failed to uncover decisive elements,
(b) “overview papers” analyzing patterns of challenges or obstacles to evaluation,
(c) “solution papers” presenting solutions for specific evaluation scenarios, and
(d) “visionary papers” discussing novel and future evaluation aspects will be considered as well.


We solicit two forms of contributions. First, we solicit paper submissions that will undergo peer review. Accepted papers will be published and presented at the workshop. Second, we offer the opportunity to present ideas without a paper submission. In this case, we call for the submission of abstracts that will be reviewed by the workshop organizers. Accepted abstracts will be presented at the workshop, but not published.

Paper Submissions

We solicit papers with 4 up to 10 pages (excluding references). Along the lines of this year’s call for papers of the main conference, we do not distinguish between full and short (or position) papers. Papers should be formatted in the new ACM single-column format, following the official templates:

Submitted papers must not be under review in any other conference, workshop, or journal at the time of submission. Papers should be submitted through the workshop’s EasyChair page at

Submissions will undergo single-blind peer review by a minimum of three program committee members and will be selected based on quality, novelty, clarity, and relevance. Authors of accepted papers will be invited to present their work during the workshop and will be published as open access workshop proceedings via At least one author of each accepted paper must attend the workshop and present the work.

Abstract Submissions

We solicit abstracts with 200-350 words, to be submitted through the workshop’s EasyChair page at

The workshop organizers will select abstracts based on quality, clarity, relevance, and their potential to spark interesting discussion during the workshop. Authors of accepted abstracts will be invited to present their work during the workshop.

Important Dates

  • Paper submission deadline: July 29th, 2021 extended: August 5th, 2021
  • Author notification: August 21st, 2021
  • Camera-ready version deadline: September 6th, 2021
  • Workshop (virtual): September 25th, 2021, 15:00-18:00 CEST
  • Workshop (additional on-site meeting): September 30th, 2021, 11:30-13:00 CEST


Please watch the videos of the accepted papers before the workshop takes place. We will focus on discussion at the workshop.

Saturday, September 25th, 2021, 15:00-18:00 CEST (online)

15.00-15.10 Welcome
15.10-15.45 Keynote: Recommender system evaluation: One gold standard, but no silver bullets by Zeno Gantner, Zalando
15.45-16.00 Break
16.00-16.45 Discussions in break-out rooms
16.45-17.15 Break
17.15-17.30 Evaluating Recommenders with Distributions (Michael D. Ekstrand, Ben Carterette, Fernando Diaz)
17.30-18.00 General discussions

Times are in CEST (Amsterdam local time).

Thursday, September 30th, 2021, 11:30-13:00 CEST (on-site, Berlage Zaal)

We will have an on-site meeting where we’ll (informally) discuss open issues and problems regarding RecSys evaluation.


Keynote: “Recommender system evaluation: One gold standard, but no silver bullets”

Zeno Gantner, Zalando

In the field of recommender systems, we have a large and diverse set of evaluation methods at our disposal for both academic research and industrial applications. Randomized controlled trials in the form of online A/B tests are widely accepted for data-driven decision making, but because of their cost in terms of time and effort they cannot support every single decision. We need different methods for different scenarios. I will present a case study on controlling the effects of different levels of exploration between control and treatment group in an online A/B test. Besides that, I will also talk about so-called diff-testing, which is a set of methods that allow estimating the impact of a change without relying on annotations or user feedback. Diff-testing is not covered much in the literature, but can be a valuable addition to a practitioner’s toolkit of evaluation methods.

Zeno Gantner is a principal applied scientist at Zalando, responsible for the area of fashion recommendations. He has more than 10 years of industry experience implementing, running, and improving ML-based production services for millions of users. His publications on diverse AI topics such as applied machine learning, knowledge representation, reasoning, and recommender systems, have been cited more than 6,000 times according to Google Scholar. Zeno has contributed to more than a dozen Free Software/Open Source projects, and has been a contributor to the online encyclopedia Wikipedia since 2002. The first ACM RecSys conference he participated in was 2009 in New York.

Accepted Contributions

All Teaser Videos on a single page.
Extended abstract about the workshop as part of the RecSys proceedings.
Proceedings (ceur-ws).


Coupled or Decoupled Evaluation for Group Recommendation Methods?
Ladislav Peska and Ladislav Maleček

Evaluating recommender systems with and for children: towards a multi-perspective framework
Emilia Gómez, Vicky Charisi and Stephane Chaudron

MOCHI: an Offline Evaluation Framework for Educational Recommendations
Chunpai Wang, Shaghayegh Sahebi and Peter Brusilovsky

Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context
Milena Filipovic, Blagoj Mitrevski, Diego Antognini, Emma Lejal Glaude, Boi Faltings and Claudiu Musat

On Evaluating Session-Based Recommendation with Implicit Feedback
Fernando Diaz

Prediction Accuracy and Autonomy
Anton Angwald, Kalle Areskoug and Alan Said

Recommender systems meet species distribution modelling
Indre Zliobaite

Sequence or Pseudo-Sequence? An Analysis of Sequential Recommendation Datasets
Daniel Woolridge, Sean Wilner and Madeleine Glick

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse
Ngozi Ihemelandu and Michael Ekstrand

  Time-dependent Evaluation of Recommender Systems (Best Paper Award)
Teresa Scheidt and Joeran Beel

Toward Benchmarking Group Explanations: Evaluating the Effect of Aggregation Strategies versus Explanation
Francesco Barile, Shabnam Najafian, Tim Draws, Oana Inel, Alisa Rieger, Rishav Hada and Nava Tintarev

Unboxing the Algorithm with Understandability: On Algorithmic Experience in Music Recommender Systems
Anna Marie Schröder and Maliheh Ghajargar


Evaluating Recommenders with Distributions
Michael D. Ekstrand, Ben Carterette and Fernando Diaz

Program Committee

Workshop Chairs

Program Committee

  • Vito Walter Anelli (Politecnico di Bari, Italy)
  • Alejandro Bellogin (Universidad Autónoma de Madrid, Spain)
  • Toine Bogers (Aalborg University Copenhagen, Denmark)
  • Amra Delić (TU Wien, Austria)
  • Linus W. Dietz (Technical University of Munich, Germany)
  • Michael Ekstrand (Boise State University, USA)
  • Mehdi Elahi (University of Bergen, Norway)
  • Maurizio Ferrari Dacrema (Politecnico di Milano, Italy)
  • Andrés Ferraro (Universitat Pompeu Fabra, Spain)
  • Hanna Hauptmann (Utrecht University, The Netherlands)
  • Dietmar Jannach (AAU Klagenfurt, Austria)
  • Mesut Kaya (Aalborg University Copenhagen, Denmark)
  • Jaehun Kim (Delft University of Technology, The Netherlands)
  • Bart Knijnenburg (Clemson University, USA)
  • Pigi Kouki (Relational AI)
  • Dominik Kowald (Know-Center Graz, Austria)
  • Sandy Maniolos (Delft University of Technology, The Netherlands)
  • Ashlee Milton (Boise State University, USA)
  • Maria Soledad Pera (Boise State University, USA)
  • Jessie Smith (University of Colorado, Boulder, USA)
  • Marko Tkalčič (University of Primorska, Slovenia)
  • Martijn C. Willemsen (Eindhoven University of Technology, The Netherlands)
  • Markus Zanker (Free University of Bozen-Bolzano, Italy)