Skip to main content

An abbreviated history of Grinnell’s end-of-course evaluations

Topics/tags: Miscellaneous, assessment, Grinnell, long

Grinnell is considering significant changes to our end-of-course-evaluation (EOCE [1]) system. Why? Well, the Faculty were told, approximately, If we don’t make a change, the system will break, and we will lose all of the historical data [2]. While it’s misleading to suggest that we will lose data [3], we do need to think about new approaches to gathering and analyzing data and, perhaps, revisit the uses and structure of Grinnell’s EOCE system.

We know that EOCEs are biased [5]. That is, there are a wealth of published studies that show bias in end-of-course evaluations, from different numeric ratings to very different word choice based on race, gender, and gender identity. The growth of online courses has even allowed researchers to eliminate many potentially confounding variables. For example, since students don’t see the professor, you can have a person teach two online sections and identify them as male in one section and female in the other [6,7]. Hence, I hope that considerations of bias will be central to any discussions we have of EOCEs.

I also hope that we’ll discontinue the optimistic myth that EOCEs can be used for both evaluation and development. As a faculty member, I want students to treat those situations very differently. If I’m using my EOCEs to improve my teaching, I’d like students to think primarily about things that need improvement. And if their focus is Things Sam needs to improve, they will naturally give lower numeric scores.

Given that it’s nearly twenty years since the Faculty agreed to impose the current set of EOCEs on ourselves, I thought it would be useful to put together a somewhat brief [8] history of end-of-course evaluations at Grinnell. At least that was the original plan for this piece. As you might expect, I could not resist adding a bit of commentary throughout [9].

When I came to Grinnell, each department had its own end-of-course evaluation forms. I see from the faculty meeting minutes that, before I came to Grinnell, the Faculty had agreed to impose such forms on all courses, but leave the particulars up to each department. If I recall correctly, the Math/CS form was mostly developmental, asking students about the strengths and weaknesses of the course. It was also almost exclusively qualitative, with only one quantitative question: How many hours per week did you spend on this class?

But there was a push for a change. Two committees on campus felt like they needed more information as they assessed faculty. The Personnel Committee felt like it needed more independent information on faculty. And the Faculty Budget Committee had been tasked with assigning each faculty a merit score. There were struggling to assign a merit score that should be based primarily on teaching without a consistent way to assess teaching. Those challenges led those two committees to suggest a common end-of-course evaluation. Executive Council led the way.

A proposal was presented at the March 15, 1999, faculty meeting. That proposal was to conduct a preliminary test of a new, common, end-of-course evaluation form with only Likert-style [10] responses. Lee Sharpe suggested that we also ask students for textual comments, and the faculty approved that amendment [11]. That form was used at the end of the semester in Spring 1999. The Office of Institutional Research [12] conducted a study that suggested that the form used in the experiment appears to be reliable and without unreasonable biases. Unfortunately, the report was on the Web at and, because Communications nuked the College’s Web site, it appears that the report is no longer available [14]. I see that the recommendation suggests that we are on the right track to developing a short evaluation form that provides some, albeit limited, information. Council suggested another two semesters of experiment with a slightly revised form.

It looks like Elizabeth Dobbs presented the next proposal on behalf of the Executive Council at the Faculty Meeting of November 1, 1999. That proposal was for another study, to be reviewed in 2000-2001. The motion read,

On behalf of the Executive Council, I move the following proposal: that Council conduct a two-semester study of an end-of-course evaluation form with revised statements, that raw quantitative and qualitative data from this study be reported only to the course instructor and that Office of Institutional Research, that the OIR report an analysis of the aggregate data to next year’s Council, and that Council then report to the faculty.

The faculty approved that proposal. One thing I recall from the discussion was a question about how different committees would deal with overlapping confidence intervals [15]. I’m pretty sure that one of the responses was While not all of us know what confidence intervals are, we can figure out the numbers.

A memo from Council to the Faculty dated November 16, 2000, summarizes the history of the effort and suggests that we adopt the consistent EOCE forms. It acknowledges the issues raised at an earlier faculty meeting.

Council also believes there is evidence that the quantitative data can be summarized in a form that will minimize the risk of invoking precise distinctions that are not supported by the data.

More importantly, the memo notes that,

We recommend that the data be used by individual faculty members, by the department chairs, and by the personnel committee specifically to identify outlying ratings, which help to focus efforts in faculty development and supplement other evidence concerning the effectiveness of particular areas of teaching. Council believes that there is also valuable information in the text comments that can be used for development purposes [16].

The resolution we were called to vote upon original read as follows.

That the faculty adopt, effective Fall, 2000, the student end-of-course evaluation system described in the November 16, 2000 memorandum to the faculty from the Executive Council. It is intended that this system would be used universally and consistently across all faculty members and courses (with the exception of independent study, guided reading, individual music instruction, and similar courses) taught at Grinnell College.

Bill Ferguson led us in a discussion of the resolution at the faculty meeting of November 20, 2000. There was a lot of discussion, including an amendment to add the text including the limits on the use of the information generated by those points to the resolution.

We did not reach consensus on the motion at that faculty meeting and voted to table the motion. I see from the minutes that we did approve another semester of the same EOCEs while discussion continued.

Fortunately, discussion resumed at the Faculty Meeting of December 4, 2000. I see that there was a lot of consideration of additional, department- or course-specific forms and what role they would serve in reviews. There was also an effort to clarify what data could be used. It ended with us agreeing that one of the central statements should be that,

In the event of a review for reappointment, promotion or tenure, department chairs and the Personnel Committee will receive summaries of the quantitative data within the review period.

The memo (which, in effect, was part of the motion) follows that with,

Chairs should not attempt to summarize the entire body of text comments, but it is permissible to use text comments from student evaluations to clarify and interpret the numeric data.

If I recall correctly, the surrounding discussion indicated that only the numeric data [17] should be available to Personnel. Certainly, that’s what the text suggests and what the instructions that accompany the form say [18].

There was then further discussion about what data would be available to the Faculty Budget Committee. Rather than attempt to summarize the minutes, I will reproduce them here.

Bill talked of another factor that needs to be discussed, namely the 3rd to the last sentence in the last paragraph on page 3 of the End of Course Evaluation proposal: Summaries of the numeric data will also be available to the faculty Budget Committee. The current procedure of the Budget Committee is to use one year’s data and to use it in thinking about the teaching component of merit pay. After much discussion President Osgood asked faculty if they wanted to make a motion to strike this sentence. Motion was made and seconded. A member of the Budget Committee commented, I hope you will not deny the budget committee this very limited amount of information to go with the chair’s letter. A vote was taken on a motion to eliminate the sentence Summaries of the numeric data will also be available to the faculty Budget Committee. Motion carried.

That’s right; even though the Budget Committee explained that they needed the data, and Council recommended that they get those data [19], the Faculty did not consider such uses appropriate [20]. From my perspective, we made the right decision. I remember, a few years later, a member of Council making some comment on the order of Our EOCE scores are like Lake Wobegon; everyone is above average. But here’s the thing; it shouldn’t be surprising that we all get high scores on questions like The instructor helped me to understand the subject matter of the course and I learned a lot in the course. Grinnell, by and large, has very good (and often excellent) teachers. You would expect people to learn a lot and to benefit from their interactions with the faculty. The fact that someone would consider that a kind of Lake Wobegon effect is troubling.

Fast forward a few years. We had moved from annual merit reviews to triennial merit reviews. The Budget Committee still struggled with how to assess faculty. I was on Council [21]. I made two suggestions: First, that in assigning merit scores, the Faculty Budget Committee move from a five-point to a three-value scale for teaching [22,23]. Second, that the Faculty Budget Committee receive the distribution of responses on two questions (Interactions with the faculty member helped me learn the subject matter of the course and I learned a lot in this course) as well as the comparable set of responses across similar courses (100, 200, and 300-level courses in Science, Social Studies, Foreign Language, Arts, and whatever we called the other Humanities) so that we could run statistical tests to see if the courses were, in fact, high or low outliers.

The faculty discussed these ideas at the faculty meeting of September 17, 2007. Eliza Willis led the discussion. I am thankful to her for that; I don’t think I could have done so.

We prepared a motion for the meeting of October 1, 2007.

The Budget Committee for the faculty may receive the total number of responses and proportion of those responses in the agree and strongly agree categories for questions #2 and #6 on the end-of-course evaluations for each course taught by faculty members under review. For comparative purposes, the Budget Committee may receive appropriate aggregate data for the faculty as a whole.

In this motion, the intent of may was is permitted to. If I was writing the proposal now, I might change the second may to must. In any case, the faculty passed that motion.

I’m pretty sure that that’s the last substantial discussion the Faculty have had about End-of-Course Evaluations. Where do we stand? Well, we still have common EOCEs; no one has (yet) proposed that we discontinue that practice. But there are also clear limits on what EOCE data committees can use: Personnel can get the summary score data. Personnel can only get the comments that the faculty member or their review chair choose to include. The Budget Committee receives less: only summary info on categories 2 and 6. The Budget Committee should also get comparative data and should use those data to identify statistical outliers. I don’t know whether they do.

In any case, nearly twenty years have passed. It’s almost certainly time for us to revisit EOCEs. The literature has grown and changed; I’m pretty sure most of the research about bias in EOCEs has come since 2000. There also seems to be greater understanding that assessment and development do not comfortably coexist. And, well, the system we have in place is dying. It may be inevitable that we go online. But if we do so, we should do so with a clear understanding of the bias at play in online evaluations (and evaluations in general) and with a full faculty discussion of the appropriate role for EOCEs in our processes.

Postscript: Because I’m somewhat anal retentive, I went to the College Archive to dig out the minutes of the Faculty Meetings from the 1999-2000 discussions of End-of-Course Evaluations. Our assistant archivist graciously scanned them for me. If, for some reason, you’d like a copy, just let me know, and I’ll send them along [24].

Appendix: For those who have not memorized Grinnell’s End-of-Course Evaluation form, or perhaps have never seen them, here are the current statements.

Q1: The course sessions were conducted in a manner that helped me to understand the subject matter of the course.

Q2: The instructor helped me to understand the subject matter of the course.

Q3: Work completed with and/or discussions with other students in this course helped me to understand the subject matter of the course.

Q4: The oral and written work, tests, and/or other assignments helped me to understand the subject matter of the course.

Q5: Required readings or other course materials helped me to understand the subject matter of the course.

Q6: I learned a lot in this course.

[1] I use EOCE. Almost everyone else uses EOC. I’m not sure why they drop the final E since the acronym refers to End-of-Course Evaluations, not just the end of a course.

[2] Given that the system is written in Microsoft Access and is almost twenty years old, it’s reasonably certain that the system will break.

[3] And, not so surprisingly, discussions founded on sensationalistic misstatements [4] don’t tend to be as successful or useful.

[4] I was tempted to write lies. But that seems excessive, and also doesn’t tend toward open discussion.

[5] By We, I mean anyone in the academy who has been paying attention. Unfortunately, it’s not everyone, and it does not necessarily include those making policies about EOCEe.

[6] I do not know of studies that explore what happens with people who do not identify on the gender binary. Nonetheless, given the other biasing effects we see, the initial assumption should be that EOCE scores will be biased against such people.

[7] I can’t find the study I’m thinking of right now. But I did find an article that looks at gender differences in the same online course.

Mitchell, K., & Martin, J. (2018). Gender Bias in Student Evaluations. PS: Political Science& Politics, 51(3), 648-652. doi:10.1017/S104909651800001X

There’s also a related article in Chronicle.

[8] This musing may be long. It could have been much longer, particularly if I had included more of the source texts.

[9] As you’ve already seen.

[10] Likert-style questions involve a statement like Interactions with my fellow students helped me learn along with a choice of levels of agreement. Grinnell uses a six-point Likert scale, from Strongly Disagree to Strongly Agree. An advantage of an even-sized Likert scale is that there is no option for a neutral response.

[11] Here we encounter the first reason that I think that every EOCE issue should go before the full faculty: Council only wanted numbers; it took someone outside of Council to suggest that there’s a value to text.

[12] That office has since been renamed the Office of Analytic Support and Institutional Research.

[14] Admittedly, I didn’t try that hard. I found an archive of the page at However, the report is in PDF, and it appears did not archive the PDF.

[15] Why do we have confidence intervals for categorical responses? Because we turn Likert-style responses (Strongly Agree, Moderately Agree, Slightly Agree, Slightly Disagree, Moderately Disagree, Strongly Disagree) into numbers (6, 5, 4, 3, 2, 1) and then average them. I don’t consider that appropriate. I’m not alone in that opinion; many statisticians also object to the treatment of categorical variables and continuous numeric variables. However, it appears that some statisticians consider this approach reasonable for Likert scales.

[16] As I’ve said, I’m less sure of that.

[17] I still think of Likert data as primarily qualitative, rather than quantitative. I am not alone. That does not mean that quantitative data can’t be extracted, such as the number of responses in each category. But that doesn’t seem to have been the intent.

[18] Instructors receive the completed, original forms only after grades have been submitted to the Registrar. Department chairs receive copies of the completed forms. These are the only people who may read your comments unless the faculty member or the department chair chooses to include anonymous quotes in documents prepared for the faculty member’s next review.

[19] Perhaps not so surprisingly, given that the Budget Committee is a subset of Council.

[20] That’s another reason I think that any changes to the EOCE system need to come before the full faculty.

[21] Believe it or not, but the Science division decided they were willing to let me serve as Division Chair for one term.

[22] I can’t recall whether I suggested 2/4/5, 1/3/5, or 1/2/3. I’m also not sure it matters.

[23] I suggested a three-point scale because I don’t believe it’s possible to make fine distinctions between the quality of teaching at Grinnell. Of course, I also don’t believe in merit raises, even though they benefit me.

[24] Wow, that sentence has a lot of commas.

Version 1.0 released 2018-10-22.

Version 1.2 of 2018-10-24.