Read February 25, 2016

The Promise of Using Evidence to Improve Education

Mark Dynarski

Advisor, Education

George W. Bush Institute

Data and evidence in education have come a long way since the Nineties. Today is the time to ensure that 15 years from now, we don’t wonder what happened with all that talk about using evidence to improve education.

As part of the Bush Institute’s accountability initiative, we are launching a series of articles and essays that will examine how states and schools can best use data to improve teaching and learning. The first of this series starts with a look at some of the history of accountability. It also looks ahead at ways in which states can use their authority wisely under the new Every Student Succeeds Act.

A large part of their work will need to involve using evidence to guide decisions about classroom work as well as new accountability systems. Research evidence is a tool for identifying programs, policies and interventions that can help students. It also is helpful to recount how testing, accountability and evidence from research came to be in ESSA. The past offers us a glimpse of what we need to do for the future.

Using tests for accountability has been happening for a long time

The Big Idea of School Accountability

Explore the history of school accountability in this essay from the Bush Institute.

Recent debates about reauthorizing the No Child Left Behind Act as the “Every Student Succeeds Act” may have left an impression that No Child Left Behind (NCLB) began the practice of using tests for accountability. In fact, the Improving America’s Schools Act (IASA), signed by President Clinton in 1994, required states to use tests for accountability.

NCLB also required states to define “adequate yearly progress,” to disaggregate results by gender, race, English learners, economic status and other indicators, and to identify schools in need of improvement. The latter meant undertaking various actions such as reconstituting personnel, transferring students, or “alternative governance.” It did not call for states to use evidence to improve education, but in the Nineties, calling for evidence would have been pointless.

States were supposed to assemble their plans and submit them to the U.S. Department of Education. Here the story gets interesting and offers a lesson for the future.

The U.S. Commission on Civil Rights pointed out that on the last day of the Clinton Administration, six years after the law had been signed, the department had approved performance standards for only 28 states and had approved assessments for only 11 states. Some states did not disaggregate their data, 22 states did not include English learners in their assessments, and 14 states did not include students with disabilities. The commission described federal enforcement as “lax.”

NCLB was different. It used many of the concepts from IASA — adequate yearly progress, in need of improvement, disaggregation — but it provided a structure within which states designed their frameworks. States could choose their own assessments, for example, and their own growth targets. The frameworks needed to be approved at the federal level. And the U.S. Department of Education enforced the law, too. Ultimately many schools and districts were designated “in need of improvement.” (The media came to call these “failing” schools, but NCLB did not use that expression, and “in need of improvement” is broader and less judgmental than “failing.”)

The disaggregated data showed that students with disabilities and English learners often did not meet performance targets and triggered the “in need of improvement” designation. In a sense, that these groups were not meeting performance targets justified requiring that results be disaggregated in the first place.

Those groups always had been behind, but NCLB’s requirements for annual testing and reporting disaggregated results made them visible. Recall that under IASA, many states did not even include these students in their assessments.

Accountability improved performance

Ironically, the timing of the ESSA reauthorization coincided with the findings of different teams of researchers that NCLB had improved achievement. Earlier studies in the mid-2000s (here and here) showed that the “consequential” accountability that various states had put into place during the 1990s had improved achievement.

Researchers used the term “consequential” to distinguish it from another kind of accountability in which states simply posted test scores publicly, “report card” accountability. The studies showed that report-card accountability essentially did nothing to improve education. There had to be consequences.

More recent studies of NCLB exploited the fact that for some states the change to NCLB’s structure did not affect them much. Their accountability structures already resembled what NCLB required. For other states the differences were vast.

If NCLB improved achievement, the thinking went, the states that were most affected by NCLB’s accountability structure — those that had few consequences built into their own structures — would show upward tilts in their test scores. States not much affected by NCLB’s accountability structure would see their score trends more or less unaffected — for them little had changed.

And that’s what studies found (here, here, and here). The “push” from NCLB improved achievement. The improvements were not huge, but there is a big difference between not huge and zero. As Eric Hanushek of Stanford University pointed out, small improvements experienced by tens of millions of children quickly add up to eye-popping gains in economic product.

On the other side, some critics argued that for whatever gains it might have achieved, NCLB harmed education by encouraging schools to focus only on tests. They depicted an Orwellian world of lifeless classrooms in which students spent the day filling in bubbles on answer sheets. Yet, studies reported that annual tests took up less than two percent of a school year.

Still, the fiction persisted: Tests were draining life from classrooms.

Mostly these criticisms were intended to terrify parents into believing that pressure from tests was diverting teachers from doing activities or lessons that were more educationally potent. The criticism asked parents to believe that teachers and students would be doing really wonderful learning in classrooms if they were not so focused on tests. It sidestepped obvious questions —if these activities led to a lot of learning, wouldn’t teachers be doing them to improve test scores? Tests measure learning. Were teachers really not doing their best stuff because of tests? Do athletes not put out their best performances because games are scored?

And what should replace tests? How will parents know how their child is doing? Here the tack was that educators are to be trusted. Teachers are professionals and if a teacher tells a parent their child is doing “fine,” then that’s that.

Professional educators know children are learning — asking for test scores was a sign of distrust. If this were medicine, it would be as if doctors were refusing to explain to patients how they reached a diagnosis, and that patients wanting to know did not trust their doctors.

Perhaps the public once was willing to accept the word of professionals without requiring objective evidence. But the same kind of thinking that results in requirements that publicly-traded corporations have external auditors applies in education.

Companies have reasons to present their financial results in favorable ways, but auditors follow standard protocols to scrutinize the accounts and present findings to the public. Districts and schools have reasons to present their outcomes in favorable ways, but tests follow standard protocols to measure student learning (hence the term “standardized tests”) and those measures can be reported to the public.

The public values objectivity. Taxpayers are spending $600 billion a year on K-12 public education and want to know what they are getting in exchange for their money. Test scores don’t measure everything about education. But they are objective. Scores can be compared between schools, between districts, even between states.

If scores are improving or declining, or if gaps between groups of students are growing wider or narrower, the public can see it. Does anyone think that schools scoring poorly on tests are doing a great job educating their students in all other dimensions of education? It is quite a leap of faith to think so.

Using research evidence will ensure that accountability continues to improve performance

Ultimately, NCLB’s target of 100 percent proficiency by 2014 was not reached. After acrimonious debates, its replacement, the Every Student Succeeds Act, devolved accountability back to the states in ways reminiscent of IASA.

Nobody argued that IASA had been so successful that it was the right model for ESSA. But NCLB’s requirement that all states fit their accountability frameworks within a general structure could not be sustained in a system of governance that gives states primary authority for their schools. States wanted their own structures.

ESSA maintained NCLB’s requirement for annual testing in grades 3 through 8 and once in high school, and it requires scores from those tests to be disaggregated. But the law did more. It called for states to use evidence to identify effective programs. NCLB had called for states and districts to use evidence, but in 2001, there wasn’t a lot of evidence available. In the 15 years since then, far more of it has accumulated.

The Institute of Education Sciences within the U.S. Department of Education has spearheaded initiatives that have both generated high-quality evidence and synthesized evidence for educators. Its “What Works Clearinghouse” has reviewed more than 11,000 studies and released nearly 600 syntheses of those reviews.

It also has convened panels of researchers and practitioners to develop practice guides on issues such as how to control classroom behavior, teach literacy to English learners, use data for decision-making, teach fractions effectively, teaching writing to elementary-schools students, and more. Every practice recommended in a guide is backed by evidence of its effectiveness.

Almost by definition, rigorous evidence is objective evidence. It’s a natural complement to objective test scores. The “scientific method” — pose a hypothesis, test it with data, and report the findings in a way that allows others to replicate the test — is objectivity at work. Values, tastes, preferences, opinions, “subjective factors” — they have no role in the method.

ESSA calls specifically for states to use evidence in determining how to intervene in the lowest five percent of schools. Schools identified by states for improvement need to develop a plan that includes “evidence-based interventions” (see Title I, Section 1005 of the bill).

Several hundred pages later, the bill defines “evidence-based” (see pages 289-290). The definition specifies tiers — evidence from experiments is viewed as “strong,” evidence from quasi-experiments is viewed as “moderate,” and evidence from correlational studies is viewed as “promising.”

The tier structure reflects the acknowledged scientific view that evidence forms a hierarchy. Some kinds of evidence are more credible than others. The world of medical research has long embraced “clinical trials” (which are experiments) as its highest standard of evidence. Education now has followed suit.

The bill also creates an alternate route — states can identify programs that have a rationale based on research suggesting the program will improve student outcomes, and the state will examine its effects. So either states use programs supported by existing evidence, or they use a program that is promising and they evaluate it. Either way, evidence is in play.

States are now working on their accountability frameworks. They need to identify programs and interventions to help their lowest performing schools.

Those programs and interventions need to be supported by evidence or by rationales drawn from evidence and coupled with evaluations. Even if the 50 states pursue 50 visions of accountability, having to test and having to use research evidence for improvement ensures objectivity and transparency. The public can know whether students are learning and can know the basis on which states are working to improve their lowest performing schools.

There is a lot of work to be done to review evidence and decide which programs and interventions pass muster. States can undertake the effort on their own, but capacity at state levels is a question.

The Institute of Education Sciences’ regional lab network can tackle setting up these tiers for states. The network covers all states and territories. The mission of the labs is to disseminate strong evidence to educators — here’s an ideal task to suit that mission.

Let’s not forget lessons from IASA in the 1990s. States have declared that they are ready and eager to take the lead in structuring accountability under ESSA. Trust, but verify.

The U.S. Department of Education needs to ensure that states are using the evidence guidelines in helping their lowest performing schools. Not just any evidence can be cited as a basis for doing something in these schools.

Future posts in this series will explore other topics related to using evidence and data in education.

What does the evidence say about using tests to improve instruction?

How can data be used to support decision-making? (Hint: analysis of data supports decision making; data are just numbers.)

What does research say about helping English learners and students in special education, the two types of students that most often triggered the “in need of improvement” status under No Child Left Behind?

Data and evidence in education have come a long way since the Nineties. Accountability is not going away, and it now includes using research to support improvements. Today is the time to ensure that 15 years from now, we don’t wonder what happened with all that talk about using evidence to improve education.