So many research papers to assess! (#1258)

Topics/tags: Overcommitment, Academia, data, unedited.

Disclaimer: This piece, more than most, represents unedited, free-flowing musing.

A little more than a year ago, while I was deep in the serenity of leave, some colleagues encouraged me to volunteer to serve as one of the three program co-chairs for the 2024 and 2025 SIGCSE Technical Symposia on Computer Science Education, the flagship conference of the Association of Computing Machinery’s Special Interest Group on Computer Science Education. I did so. And my application was accepted.

It seemed like a good idea at the time [1].

As I noted, life seemed calm on leave. I also realized that I get a sense of meaning from teaching; it’s perhaps the primary way I contribute to the world. Without teaching, I was feeling less meaningful, and this kind of scholarly service [2] provided another sense of meaning.

Life has gotten much more complicated since then. It also doesn’t help that I’m significantly revising one course this semester and have made too many changes to another course, one that I haven’t taught in about five years [3]. But I’m sticking with my commitment.

We are now in the midst of deciding what papers to accept to the conference. We had over 700 intended submissions [4]. We ended up with over 650 papers; most approach the conference limits of six pages (10 pt, two column, single spaced) plus a page of references. Various constrains suggest that we can accept about 200 of them. As you might expect, there’s no way the three Program Co-Chairs can read all 650+ papers. And, in any case, we shouldn’t be making the decisions based only on our own opinions. So we recruited a large cohort of reviewers and a somewhat smaller cohort of Associate Program Chairs (APC) who manage the reviewers and followup discussion for around eight papers each.

About two weeks ago, we went through and ensured that every paper had a least three reviews [5]. If one didn’t, we found additional reviewers. Our APCs then ran discussions amongst the reviewers of each paper and wrote meta-reviews that summarized the individual reviews and the discussions.

As I said, we are now in the midst of deciding what papers to accept. Perhaps it’s time to do some math.

If each of the three program chairs read all the reviews, discussion, and metareviews for each paper, we’d probably spend at least ten minutes per paper, and that doesn’t even count the time talking to each other about them. With more than 650 papers, that would be 6500 minutes, or 108 hours. Each. In a week-and-a-half timeframe. While we have full-time jobs.

Did I mention that this is volunteer work, and we are unpaid? [6]

As you might expect, we look for ways to cut down on the work. We split the papers up, with each of us handling about 1/3 of the papers. It doesn’t quite work that way; some non-trivial number of papers need further discussion. Perhaps the meta-review is brief. Perhaps the APC makes a different recommendation than the authors. Perhaps one of the reviewers has had a very different response than the others. Perhaps the APC or one of the reviewers has added semi-confidential notes to the Program Chairs and we need to consider them. I’d guess that puts us at about 300 papers each. Is that better?

300 papers at 10 minutes each is still 3000 minutes or 50 hours. That doesn’t count our need to discuss, to send email, to address issues folks have raised. I know that we spent thirty minutes debating some papers that had mixed reviews.

So we look for other ways to make ourselves more efficient.

I mentioned this issue to a colleague in another department. Their first response was Why aren’t you paid to do this? Their next was Can’t you just accept the top 25%, reject the bottom 50%, and focus on the remainder? [7]

In short, no.

For example, consider a paper that one reviewer rated as a Clear Accept (5, Content, presentation, and writing meet professional norms; improvements may be advisable but acceptable as is), two rated as a Marginal Tend to Accept (4, Content has merit, but accuracy, clarity, completeness, and/or writing should and could be improved in time), and one rated as a Probable Reject (2, Basic flaws in content or presentation or very poorly written). The average score is 3.75, putting it in the bottom 50% [8]. However, it may be that the Reviewer who gave it a Probale Reject has a problematic review. Perhaps they have critiqued a potentially irrelevant point (e.g., complaining about sample size for a paper based on detailed interviws). Perhaps they haven’t really explained their rating. Perhaps we find that they give low scores to all papers. If we discount their score, the average rating rises to a bit above 4.3, putting the paper in the top 25%.

On the other hand, perhaps we had three Clear Accepts and one Probable Reject. Perhaps the outlying reviewer has noted a significant flaw that the other reviewers missed. In that case, we should reject the paper, even if the score might suggest otherwise.

Then there’s the whole issue of turning qualitative ratings into numeric averages. It’s problematic, at best. But you have to start somewhere.

In addition, some reviews are bad. We have some responsibility to ensure that authors get appropriate feedback. And sometimes the bad reviews are also outliers that bias the average. A 6 (candidate for best paper) and three 3’s has the same overall average as two 5’s, a 4, and a 1. So we can’t treat them the same. Perhaps I said that already.

Can we trust the APCs? In general, yes. However, we can’t completely follow the APCs’ recommendations, either. Why not? They’ve recommended accepting about 260 papers, about 25% more than we can accept. They’ve also encouraged us to look at another 55 or so papers, some of which should be accepted. That’s far beyond what’s possible.

In addition, some metareviews are bad. Authors deserve good metareviews. Or at least not-bad metareviews.

We also have a responsibility to consider the overall structure of the program. Perhaps we don’t need 50 papers about GitHub Copilot in CS1 [9], even if they all have high rankings. The community will be better served with a mixture of topics.

And so we need to pay some attention to all of the papers, even those with high scores and low scores. And even those that APC recommends we accept.

So how do we make the workload less onerous? [10]

Whenever possible, we focus on metareviews (and notes to the Program Committee) rather than the individual reviews. At least I do. That may mean that we (I) miss issues in the individual reviews. However, we need to cut time.

I’ve also been skimming the rest of each paper’s entry to look for individual reviews that are too brief. I believe my colleagues have, too [11]. However, in most cases the APCs have asked these reviewers to update their reviews and the reviewers have ignored the requests. Unfortunately, we have to accept that some authors will get some less-than-helpful reviews. When possible, we are taking note of these reviewers, and most will not be invited back in the future [12].

Have I mentioned that most reviewers and most APCs have done a wonderful job? I should. Unfortunately, people don’t remember the good reviews. They remember the bad ones. Reviewer #2 is always a force to be reckoned with [14].

Soon we’ll be sending out our recommendations. Then we have a week or two off and then we move on round 2. Fortunately, we make fewer of the decisions for round 2.

Did I mention that I also have to catch up on my grading?

Postscript: I worry that we rely on word of mouth for deciding on how to address these issues. For example, I know that some program chairs have read only the metareviews, but that others consider it their responsibility to read all the reviews. There are also questions as to whether we follow a pure numeric scale or whether we can shape the program. I’d like to see us record procedures and guidelines.

Postscript: It may be that the SIGCSE Technical Symposium has grown enough that we need to find a new model. I have one in mind. I expect to muse about it in the next few weeks, when I find time.

[1] It seemed like a good idea at the time is also the name of a series of panels that have been presented at the SIGCSE Technical Symposium in various years.

[2] Yes, it’s both scholarship and service. Grinnell doesn’t seem to believe you can do both in the same activity, but you can.

[3] I don’t know about the rest of you, but I find that teaching a course again after five years makes it feel new.

[4] That is, authors sent us an abstract and promised to submit the paper soon thereafter.

[5] I recall a time when we had five or so reviews per paper. But there were significantly fewer submissions.

[6] We do receive complementary registration to the conference and get our hotel rooms covered. I’m not sure whether or not our airfare is also covered. However, we also spend the conference working.

[7] Those may not be the numbers they used, but they give you a sense of the comment.

[8] If we accepted all the papers in which the average score was 3.5 or above, we’d end up with 400 accepted papers, about twice what we can accept.

[9] No, we did not get that many.

[10] Musing about it is not the answer. That also takes time.

[11] It sounds like my colleagues have done more detailed reading of reviews than I have. I’m not sure how they found the time.

[12] I’d like to encourage these reviewers to do better. Perhaps some will, with additional feedback. But the ones who’ve ignored requests to improve probably need to be dropped.

[14] Reviewer #2 is an academic meme.

Version 1.0 of 2023-10-01.

The opinions stated herein are those of Samuel A. Rebelsky and do not necessarily reflect those of Grinnell College, Grinnell's Computer Science Department, the Rebelsky family, CMD-IT, SIGCAS, SIGCSE, any other organizations I am or have been affiliated with, or even most other sentient beings.

Check accessibility with WAVE.

SamR's Assorted Musings and Rants: So many research papers to assess! (#1258) by Samuel A. Rebelsky is licensed under a Creative Commons Attribution 4.0 International License.

This Web site was built using Markdown, some custom scripts, Twitter Bootstrap, and the Bootswatch Readable Theme.