Fill out the brief form below for access to the free report.

  • George W. Bush Institute

    Our Ideas

  • Through our three Impact Centers — Domestic Excellence, Global Leadership, and our Engagement Agenda — we focus on developing leaders, advancing policy, and taking action to solve today’s most pressing challenges.

I'm interested in dates between:


I have minutes to read today:

Programs & Issues


Publication Type
I'm interested in dates between:
Reading Time

I have minutes to read today:

The Randomized Controlled Trial In Program Evaluations: A Gold Rush Too Rushed?

March 12, 2013 by Catherine Freeman

This guest blog was written by Steven M. Ross, PhD, with the Center for Research and Reform in Education at Johns Hopkins University.

Before answering the above question, I should first clarify what it is asking.   Randomized Controlled Trials (RCTs) are considered by many researchers, policy makers, and notably, by the U.S. Department of Education’s What Works Clearinghouse as the “gold standard” of program evaluation.   Importantly, by randomly assigning participants to the intervention group (i.e., those exposed to the new educational program) and the control (“business-as-usual”) group, RCTs help to ensure equivalence in the two samples.   The participants may be schools or teachers, classes, or students in selected schools.  Importantly, if the intervention group outperforms the control group, we can have much greater confidence that the program of interest (not superior students or teachers) caused the effect.   

In my work at The Center for Research and Reform in Education (CRRE) at Johns Hopkins University, I frequently plan evaluation studies with program developers and educational leaders responsible for selecting programs.  Their introductory comments often allude to the desire for nothing less than a RCT—the gold standard!  But, in the majority of circumstances, such “prospecting” seems highly ill-advised.  Here are some thoughts:

  • The development of effective educational programs takes time.
  • A high-fidelity implementation of new programs by schools takes time.
  • Rushing into a RCT to judge the effects of a developing program or early implementation could well show outcomes that are worse, not better, than those of an established control program.
  • RCTs are relatively expensive and demanding to conduct as a result of needing large enough samples to compare, recruiting participants willing to cast their fate (program vs. control status) to a coin flip, and other logistical needs.
  • Randomization, in itself, creates some artificiality in conditions compared to those in real-life.  There is even research evidence that control schools sometimes try harder to succeed as a result of not being selected (the “John Henry Effect”).  Or, in other situations, the program developers and the “chosen” practitioners may produce positive results (“Fool’s Gold?”)  by making extraordinary efforts to demonstrate success with the new program.


To help determine the level of evaluation best suited to the developmental status of programs, we informally employ the following hierarchy with our clients:

  1. Design and Implementation Quality:  Smaller “formative evaluation” studies, often using observations, interviews, and surveys that focus on program’s design components and how they are received and used by target consumers (e.g., teachers, students, parents, etc.).   Program improvements are directly informed by results.  
  2. Efficacy:  Medium-scale studies that focus on how the program operates and affects educational outcomes in try-outs in pilot schools or small treatment-control group comparisons.
  3. Effectiveness.  Larger-scale “summative evaluation” studies that focus on the success of the program in improving outcomes in rigorous non-randomized (“quasi”) experimental studies or randomized controlled trials.


Aspiring to RCTs is certainly a worthy future goal for program developers and consumers.  But all that glitters in evaluation pursuits isn’t rushing to achieve the gold standard, but rather employing the particular approach—design, efficacy, or effectiveness-- that best fits the current status and reach of a program.  For program evaluators, developers, and users, that should be a Golden Rule.