Student Evaluation of Teaching (SET): Guidelines and Recommendations for Effective Practice

These guidelines and recommendations were developed from a review of the research conducted by a subcommittee of the Center for Excellence in Learning and Teaching. We encourage readers to review these materials in order to promote the effective use of Student Evaluation of Teaching (SET). We hope that these guidelines, combined with suggested questions, and resources will offer useful strategies to enhance teaching and learning. The SET is one of several mechanisms for improving teaching and learning, and should not be the only data source for evaluating instruction.

The guidelines and recommendations are organized into the following categories (quick links):




Analysis and Reporting


Effective uses of student ratings:

  • focus on accurate, timely, and usable measures of learning outcomes
  • are commonly understood and accepted
  • serve to improve instruction as well as to evaluate faculty performance; such uses should be commonly understood and accepted by institutional stakeholders

Additional Information on Use Guidelines

  1. Student evaluation of teaching should be part of an overall strategy for improving student learning. Use SET along with other assessment methods (e.g., mid-semester feedback, peer observation, teaching portfolios) (Algozzine et al., 2004; Cannon, 2001; Cashin, 1999; Marincovich, 1999; Sproule, 2000).
  2. Faculty and administrators should develop a shared understanding of how student evaluation information is used and its purpose at the institution. This information can be used in various ways (e.g., provide information for improvement, provide information to evaluate the course, offer feedback to faculty, contribute to promotion and tenure decision-making) (Algozzine et al., 2004; Arreola, 2000; Marincovich, 1999; Theall & Franklin, 2001).
    • SET can be used to improve instruction.
      • Rating on global items (e.g., “rate this course (or instructor) overall”) does not provide information specific enough to guide improvement in a course. Specific information is needed in order to improve instruction (Cohen, 1983; McKeachie, 1997).
      • Grouping individual items by factors or dimensions (e.g., instructor enthusiasm, organization/clarity, breadth of coverage, group interaction) may be the best method of providing meaningful feedback to instructors (Algozzine et al., 2004; Marsh & Dunkin, 1992).
    • SET alone does not provide sufficient information for making employment decisions.
      • No single criterion of effectiveness is widely accepted (Abrami, 1993; Marsh, 1995).
      • There is continued debate over the merits of using overall teaching scores for personnel decisions as opposed to multidimensional profiles of teaching effectiveness (Harrison, Douglas, & Burdsal, 2004).
      • Personnel committees should use broad categories (e.g., promote/don’t promote, deserves merit increase/deserves average increase) rather than “attempting to interpret decimal-point differences” in making evaluations (McKeachie, 1997).
      • Some argue that teaching is multidimensional and that individual dimensions should be considered separately in evaluating teaching effectiveness. Problems with global evaluations prompted use of dimensions or subsets of student ratings (Marsh & Roche, 1997; Rice & Stewart, 2000).
      • There is concern that student, course, and instructor differences are ignored in reporting of “effectiveness” (Sproule, 2000).
  3. The use of SET should focus attention on improving teaching and learning outcomes, rather than simply improving perceptions of the instructor.
    • One study found that students believed improving teaching and course content was the most important outcome of evaluating teaching, ahead of staffing and future course planning (Chen & Hoshower, 2003).
    • The typical evaluation form does not rate student-centered or active approaches to learning such as collaborative learning (Centra, 1993).
    • Items should focus on faculty members’ effectiveness at creating an environment for learning and de-emphasize potentially superficial indicators of entertainment value or personal charisma (Shevlin, Banyard, Davies, & Griffiths, 2000). For example, ask students to “consider how they have been changed by their encounter with the course material, not how they have been entertained by [instructor] performance” (Hodges & Stanton, 2007).
    • Murray (2001) recommends that faculty either write a reflective essay that includes an interpretation of the results or fill out the same form and discuss differences compared to student responses.
  4. SET can provide accurate, timely, and useable results for reporting (Arreola, 2000).
    • Grouping items by factors (e.g. organization, clarity of communication, etc.) may be the best method of providing meaningful feedback to instructors (Algozzine et al., 2004).
    • Strive for quick processing and return of forms (Marincovich, 1999).
    • Because previous research highlights differences in mean scores by discipline, comparisons with a college mean should be interpreted cautiously (Murray, 2001).
    • Most common rating problems are misuse of rating data; bad ratings due to poor instrument construction, administration, or analysis; and misinterpretation (Franklin, 2001).
    • Include other data sources to make decisions about teaching effectiveness (Algozzine et al., 2004).


Effective student evaluation of teaching instruments are valid, reliable, and practical. They:

  • include open- and close-ended questions
  • include intentional measures of both general instructor attributes (e.g. enthusiasm or effectiveness) and specific instructor behaviors (e.g. listening, providing feedback)
  • use consistent scales (e.g., five-points, same direction, 1 = low, 5 = high) and a no-opinion option
  • produce useful feedback to instructors that can inform their teaching
  • can be completed thoughtfully within 10 to 15 minutes

Additional Information on Instrumentation Guidelines

  1. Instruments need to measure what they purport to measure (that is, be valid) and measure consistently over time and among respondents (that is, be reliable) (Arreola, 2000; Franklin, 2001, Hobson & Talbot, 2001).
  2. Institution-developed instruments should clearly identify purpose, select specific aspects to be evaluated (e.g., course design), choose type of items (e.g., specific faculty behaviors, general overall effectiveness), and establish reliability and validity (Arreola, 2000).
  3. Instruments should include both open and close-ended questions (Arreola, 2000; Cashin, 1999).
    • Ask students to “consider how they have been changed by their encounter with the course material, not how they have been entertained by our [instructor] performance” (Hodges & Stanton, 2007).
    • Questions should provide a reflective opportunity for students and reveal something to the instructor about teaching in the discipline (Rando, 2001).
  4. Use low-inference (individual behaviors) and high-inference (global measure – enthusiasm, clarity, or overall effectiveness) items that align with purpose (Abrami, 1985; Cashin, 1999, Cashin & Downey, 1992; Marsh & Roche, 1997, Murray, 1983; Renaud & Murray, 2005).
    • Low-inference (specific individual behaviors)
      • Single score or measure of overall course cannot represent the multidimensionality of teaching (Marsh, 1987).
      • Specific behavior items can be clear to students and offer easy interpretation for instructors (Cashin, 1999).
    • High-inference (overall global measures)
      • Low inference ratings may be more affected by the systemic distortion hypothesis (that is, traits can be judged to be correlated when in reality they have little or no correlation) (Renaud & Murray, 2005).
      • Ratings of overall effectiveness are predictable from specific classroom behaviors of instructors. Observer reports of these behaviors (e.g., addressing individual students by name, placing outline of lecture on the board) correlate with student ratings of overall teaching and measures of student learning and motivation (Murray, 1983).
    • Combine related individual items into a factor or dimension (Cheung, 2000; Harrison, Douglas, & Burdsal, 2004; Rice & Stewart, 2000; Marsh & Duncan, 1992; Neumann, 2000).
      • “Grouping items by factors may be the best method of providing meaningful feedback to instructors” (Algozzine et al., p. 135, 2004).
      • Covariations in student ratings of teacher effectiveness were closely related to observed differences in teaching behaviors (Renaud & Murray, 2005)
  5. Potential control variables may bias results
    • Although many variables may influence student ratings, unless they moderate the correlation between student ratings and student learning, they cannot be described as biasing variables (d’Apollonia & Abrami, 1997).
    • Review of the research literature suggests minimal effect on evaluation results of several variables relating to course (size, subject matter, required vs. elective, time of day, difficulty level), instructor (e.g., personality, rank, research productivity, grading leniency, gender, race), and student (e.g., GPA, age) (Centra, 1993).
    • Research on the effect of specific factors (e.g., class size, course level, student workload, previous experiences with instructor, gender, and instructor characteristics) on evaluation results offers mixed results (Algozzine et al., 2004).
  6. There is a need to have a consistent point scale (Cashin, 1999), consistent scale direction (e.g., 1= low, 5 = high, Arreola, 2000), and a no opinion option.


Administration of SET is effective when:

  • Students know that their responses are important and will be used
  • Evaluations are administered the week prior to the last week of teaching
  • Someone other than the faculty member administers the evaluation
  • Students are assured and receive confidentiality/anonymity
  • Administrative guidelines are clear and consistent

Additional Information on Administration Guidelines

  1. Share with students the significance of their participation and how the information will be used. One useful strategy is to include on the instrument some examples of how the data is used. Another strategy is to demonstrate to students how feedback is used (e.g., share in syllabus an example of how student evaluation helped to improve the course (Chen & Hoshower, 2003; Svinicki, 2001).
  2. The best time to administer teaching evaluations is during the week prior to the last week of teaching (Cashin, 1999).
  3. The faculty member should not administer the evaluation. A colleague, staff member, or student should collect completed evaluations, seal them in an envelope, and return them to the department’s administrative staff (Cashin, 1999, Marsh & Roche, 1997).
  4. Remind students that faculty will not see evaluation results until after final grades are submitted (Cashin, 1999).

Analysis and Reporting

Effective Analysis/Reporting of SET avoids misuse and misinterpretation and involves:

  • An appropriate unit of analysis
  • Comparisons between similar instructional situations only
  • Attention to the quality of data. Is the sample size adequate and representative? Is there a high degree of variability in some items (i.e., large standard deviation)?
  • Thoughtful analysis of written comments

Additional Information on Analysis and Reporting Guidelines

  1. Use the appropriate unit of analysis. Class average is usually appropriate, while use of individual student scores is rarely appropriate (Marsh & Roche, 1997).
  2. Examine the quality of the evaluation data through attention to sample and data variation. It is necessary to determine whether the sample is representative of the population (e.g. response rate, courses and respondents representative). In addition to mean scores, summary information should include descriptive data such as standard deviations to determine variability of the data. (Arreola, 2000; Cashin, 1999).
  3. Analyze students’ written comments (Lewis, 2001; Marincovich, 1999).
  4. Avoid misuse and misinterpretation (Theall & Franklin, 2001). Most common rating problems are misuse of rating data; bad ratings due to poor instrument construction, administration, or analysis; and misinterpretation (Franklin, 2001).