Steven Volk (September 21, 2014)

A slight detour this week from the daily business of the semester to a look towards its end. This Article of the Week was spurred by an article which appeared in the *Chronicle of Higher Education* this past week. In “Scholars Take Aim at Student Evaluations’ ‘Air of Objectivity,’” Dan Berrett reports that a new examination of end-of-semester student evaluations has found that they “are often misused statistically and shed little light on the quality of teaching.” Other than that, they’re probably OK. (That’s just me being snarky, so disregard.) More seriously, the draft study by Philip B. Stark, a professor of statistics at UC Berkeley, and Richard Freishtat, senior consultant at Berkeley’s Center for Teaching and Learning, repeats some of the critiques that have been leveled against Student Evaluations of Teaching (SETs) for a long time and raises some new ones.

I can’t comment on the research design or reliability of these studies, and there are certainly arguments in favor of SETs, but the following data has been reported over the years:

- Some research found a correlation between SET scores and students’ grade expectations although revenge (“I’m going to get back at that teacher”) was not found to be an important element.
- Effectiveness scores (faculty rated high for being “effective” teachers and enjoyment scores are related.
- Students’ ratings of instructors can be predicted from the students’ reactions to 30 seconds of silent video of the instructor; first impressions may dictate end-of-course evaluation scores; and essentially represent nothing more than a snap judgment by students. [Pamela Ann Hayward, “Students Initial Impression of Teaching Effectiveness,” PhD dissertation, University of Illinois, Champaign-Urbana, 2000).]
- Gender, ethnicity, and the instructor’s age and physical attractiveness matter in SET ratings.
- SETs don’t tell us anything about the quality of the teaching.
- They can be used by some students to get back at faculty for not teaching the course that they, the students, wanted.
- SET scores correlate with the lecturer’s charisma and other factors unrelated to teaching.

The Berkeley authors add a new critique based on the how the numbers generated by SETs are understood. For one thing, response rates can seriously skew the data. Has anyone else felt that same sinking feeling when, on the day you hand out teaching evaluations, the three students who you *knew* were loving the class were absent? Raise your hands. Non-responders are not the same as responders and the higher the rate of non-response on the day student teaching evaluations are distributed, the less useful is the final average – and yet that’s the number that just stands there nakedly, without further clothing or explanation.

It almost goes without saying that the averages of small samples are more susceptible to what the authors call the luck of the draw. If two students out of 8 are missing on evaluation day, does the final average tell us anything? And this goes along with the fact that anonymity is almost always lost in classes under 6-7, making the findings much less reliable. At the very least, we need a policy for collecting SETs in courses of 10 or under.

The authors also point to some common statistical errors that are made when the one number we focus on is *the average. * “Academic personnel review processes often invite comparing an instructor’s average scores to the departmental average,” they write. “Such averages and comparisons make no sense, as a matter of statistics. They presume that the difference between 3/7 [a score of 3 on a 7 point scale] and 4/7 means the same thing as the difference between 6/7 and 7/7. They presume that the difference between 3/7 and 4/7 means the same thing to different students. They presume that 5/7 means the same thing to different students and to students in different courses. They presume that a 3/7 ‘balances’ a 7/7 to make two 5/7s. For teaching evaluations, there is no reason any of those things should be true.”

Now, I’ll leave it to my colleagues in statistics to fully evaluate the case the authors make, although it seems right to me on a basic level. Using the Oberlin 5-point scale, if an instructor received one “5” and one “1,” would we have learned that he was a “3”? Or, to put it a different way, what *would* we have learned? (I can’t resist repeating the authors’ statistics joke: Three statisticians go deer hunting. The first shoots and misses a yard to the left; the second shoots and misses a yard to the right; the third yells “We got it!” Is this what statisticians do when they get together???)

Anyway, there are more issues to consider. A lot of teaching is incommensurable in a quantifiable way. Teaching a “service-oriented” introductory course is not the same as teaching an upper-level seminar; teaching at 9:00 AM is not the same as teaching at 1:30 PM, etc. These are common issues. None of them are meant to suggest that we shouldn’t be doing our best in all of these courses or that there is absolutely no way to figure out what any single instructor is doing since we’re all different. We should and we can…but SETs are a relatively thin thread on which to hang the evaluation of teaching, something we hold to be massively important.

We have done a number of things to improve Student Evaluations of Teaching at Oberlin. Some years ago we put every department on the same 1-5 scale and made sure the “description” for each number is uniform across campus. We removed those questions that students aren’t competent to answer (e.g. does the instructor know the subject matter?), have made the questions as straight-forward and clear as possible so that when students answer them we know what question it is that they are answering (*validity*). And we designed our SETs based on research that has found which areas of inquiry are likely to produce *reliable* data (i.e., do different students – say a first-year and a junior – give the same instructor similar marks; would the same student give the same instructor the same mark later, etc. In short, do students often agree.).

Still, the literature in this area is abundantly clear, and this new study out of Berkeley only makes the point more persuasively. SETs can tell you certain things. They can tell you: (1) about student engagement with the class and whether engagement has changed over time; (2) who are the outliers in different categories – who seems to be consistently at the top or toward the bottom of the ratings; and (3) some information from a careful reading of written student comments – but such comments are usually not comparable across classes.

What they don’t tell you is the effectiveness of the instructor as a teacher or whether the students are learning. (Grades and exam scores don’t do that either since we don’t know if exams are “hard” or “easy,” let alone the value added by the course to the students’ learning.)

So what *do* we know about evaluating teaching? We know that:

- SETs, as stated above, can provide valuable information about student engagement and student evaluation of their own learning … but only in a few areas, and that it is more useful to read the comments and see changes over time than to generate a set of averages, let alone – please, no! – combining all the numbers into one big “average” for the instructor: He scored 3.8 in that History class.

- There is no such thing as a
*perfect*measurement of teaching effectiveness, but that shouldn’t stop us from putting more effort into what can produce better information that can help us understand more about the kind of teaching that is going on in our classes.

- We need to be using multiple measures of teaching effectiveness, particularly when talking about high-stakes evaluations (reappointment, tenure, promotion in rank). These include:

- SETs, when used as suggested above, from which we can learn whether the instructor is engaging students in class in ways that are important, not just entertaining.

- Peer observations of teaching by
*trained*observers, not just colleagues who know the subject matter. These observations need to be made uniform across departments and programs, including training for the observers, a pre-observation interview to get a better sense of what the instructor is intending to get at in the class that will be observed, and a post-observation interview to allow for further clarification of what went on. Here’s one example of this process from the teaching center at the University of Southern California.

- A “forensic” examination of course syllabi, usually by colleagues in similar fields outside of the college to determine whether the instructor’s teaching is keeping up with the field.

- Finally, and most importantly, I would argue, is a teaching portfolio which allows the instructor to discuss her approach to teaching and how it has developed over time. A teaching portfolio would include examples of teaching materials and samples of student work and would indicate how an instructor has used earlier critiques to refashion and rethink her teaching. Teaching portfolios are geared to helping the instructor and those reviewing the portfolio see the
*dynamics*of teaching and how/if the teacher is looking for ways to incorporate valid critiques of teaching or new information and research about pedagogy into her courses. [Peter Seldin,*The Teaching Portfolio: A Practical Guide to Improved Performance and Promotion/Tenure Decisions*, 4^{th}Ed. (Jossey-Bass, 2010).]

Stark and Freishtat pose the following questions, which I think are good ones to ask. When you come right down to it, we need to have a system of teaching evaluation that can help us answer them.

Is this a good and dedicated teacher? Is she engaged in her teaching? Is she following pedagogical practices found to work in the discipline? Is she available to students? Is she putting in appropriate effort? Is she creating new materials, new courses, or new pedagogical approaches? Is she revising, refreshing, and reworking existing courses based on feedback and on-going investigation? Is she helping keep the curriculum in the department up to date? Is she trying to improve? Is she improving? Is she contributing to the college’s teaching mission in a serious way? Is she supervising undergraduates for research, internships, and honors theses? Is she advising and mentoring students? Do her students do well when they graduate?

Evaluations of teaching that can help us answer these questions would, indeed, be valuable.