Read August 29, 2018

When Done Right, Standardized Tests Reveal a Student's Knowledge

Mark Dynarski

Advisor, Education

George W. Bush Institute

Parents may be curious, even skeptical, about the validity of the standardized exam their child took. But Bush Institute Education Reform Fellow Mark Dynarski explains how a rigorous process goes into creating those exams, and why they reveal a student's knowledge.

Not many people look forward to taking tests. But when we look on our doctor’s wall and see a plaque indicating that she is “board certified,” we might think the certification is a good thing. The doctor passed a test. Or a lawyer might be “admitted to the bar,” which is just a phrase meaning he’s passed the state’s bar exam, a test. We may not like taking tests, but we are happy that people helping us passed the ones they took. Nobody wants a doctor or lawyer whose knowledge is below the accepted standard.

This dynamic is not at work with parents and students. When our kids take state or district tests in schools, there often are complaints that tests are too narrow and do not capture all the dimensions that parents want in their child’s education. Or you hear that parents already know their child’s abilities and don’t need a test to tell them.

It is useful to take a step back and ask an important question — does the test score reflect what a child knows? If the score does reflect a child’s knowledge, the test is not flawed. It’s doing what it is designed to do. We can argue whether and how other education objectives should be measured, and the democratic nature of public education suggests people should be debating other objectives. But knowing whether test scores reflect knowledge seems to be a first step before debating other aspects of testing.

How tests are created

Standardized tests do not create themselves, of course, and parents who are not themselves educators might be curious, or even skeptical, about how tests are created. The process might seem like a big black box. In fact, it’s a rigorous and highly scientific process, one that has been developed over a hundred years. The process reflects research contributions of generations of some of the world’s most esteemed practitioners of probability and statistics. It has its own subfield, psychometrics, and every year universities graduate new PhDs in that subfield.

We can think about test development on a large scale by first thinking about test development in miniature. Think about the process a high school teacher goes through to design a test related to, say, the lessons he or she just taught on linear equations in algebra.

Standardized tests do not create themselves, of course, and parents who are not themselves educators might be curious, or even skeptical, about how tests are created. The process might seem like a big black box. In fact, it’s a rigorous and highly scientific process, one that has been developed over a hundred years.

The teacher delivered a certain amount of material in the form of general class lectures, homework, and other kinds of assignments like group activities. What the teacher taught is related to a set of “content standards” that each state has developed and usually has posted online.

The Common Core was one such set of standards that many states adopted wholly or in part. Debates about the Common Core generated a lot of heat, but anyone reading Common Core standards must have wondered what the debates were about because the standards themselves are astonishingly boring.

Here’s an algebra standard from the Common Core: Solve linear equations and inequalities in one variable, including equations with coefficients represented by letters. Interested readers can peruse other standards for algebra here, and can find all the standards here.

To a teacher, this standard indicates that students should be able to do this: for the equation3x + 4 = 13, x equals 3. Or, if the equation is ax + b = c, then x equals (c – b)/a.

In this second equation, coefficients are represented as letters.

Back to the test. The teacher wants to know if her students meet the standard. She thinks a reasonable question is to ask students to solve 15y + 10 = 40. The test question includes a subtle difference — students need to solve for “y” rather than “x”—but it’s still an equation in a single variable and students should see that. The teacher might up the difficulty level a bit, by asking students to solve y + 3y + 10 = 50. Students need to add the two “y” terms, but it’s still an equation in a single variable.

The floor and ceiling effects

Another concept is useful here. If the teacher makes the test so easy that many students get all the questions right, she will have created what test designers call a “ceiling effect.” The teacher cannot distinguish which students have a high level of knowledge from those with a low level of knowledge — test scores all being 100 percent means students look the same. Their true abilities are somewhere above their score, which forms a “ceiling.”

The same concept applies in the other direction. If the test is so hard few students get any questions right, a “floor effect” arises. If many students get a score of zero, a teacher cannot distinguish between knowledge levels of her students. Their true abilities lie below the floor.

A test with no ceiling or floor effects needs to have questions with higher and lower degrees of difficulty. Students with strong knowledge are able to answer the harder questions and students with weaker knowledge are not. The teacher might then give students with strong knowledge an “A” grade and students with weaker knowledge a “C.”

On a much larger scale, this process is used to create assessments such as the PARCC test (the acronym stands for the Partnership for Assessment of Readiness for College and Careers), Smarter Balanced, and the Texas STAAR test, which stands for the State of Texas Assessment of Academic Readiness. Each of these tests is designed by teams of educators and testing experts working in concert (here is a visualization of the process).

Designing the tests starts with the standards, and educators and testing experts develop banks of questions related to those standards, such as the algebra ones above. Questions in the question bank are scrutinized to ensure they test what the standards call for, and that their wording is clear and not inappropriate, or “biased,” against any race or gender.

For example, a math question that involves calculating a baseball player’s batting average might have an issue that some students may not play baseball or are otherwise unfamiliar with its rules. Revising the wording to be about calculating a simple average (without referencing baseball) might be the solution, or simply substituting another question.

Designing the tests starts with the standards, and educators and testing experts develop banks of questions related to those standards, such as the algebra ones above. Questions in the question bank are scrutinized to ensure they test what the standards call for, and that their wording is clear and not inappropriate, or “biased,” against any race or gender.

Each question on the PARCC tests was reviewed by 30 or more people before it was used. Questions that made the cut were then pilot-tested in 14 states and nearly 16,000 schools. Smarter Balanced followed a similar process, testing more than 5,000 items in 21 states and more than 5,000 schools.

At any grade level, the test is likely to contain at least a few really hard questions, ones that may seem well beyond the abilities of students in that grade. Nobody expects fifth-graders to comprehend a reading passage from Hamlet, for example. Students and teachers tend to remember these kinds of questions, but the questions are not on the test simply to create pain and discomfort.

Seeding the test with difficult questions avoids the ceiling effect, and test scores are better able to discriminate between students who have a basic level of proficiency and students who have an advanced level of proficiency. Some fifth-graders will comprehend that Hamlet passage. And there might still be students who get all the questions wrong or right, but the likelihood of it happening is rare.

Some complaints are deserved

Some aspects of tests deservedly draw complaints. For example, test-score reports to parents often are laden with statistical jargon such as norms, percentiles, normal curve equivalents, stanines, lexiles, and proficiency levels that are based on…who knows.

A parent whose child scores at the 65^th percentile in fourth grade and the 65^th percentile in fifth grade might wonder whether their child is standing still. But they aren’t—in fact, they’ve learned a year’s worth of material.

Their child scored better than 65 percent of fourth graders, and a year later their child scored better than 65 percent of fifth graders, which means their child had to have learned fifth-grade material. But the point is that test designers have done themselves no favors by providing reports that require parents to grapple with statistical concepts to make sense of scores.

Some aspects of tests deservedly draw complaints. For example, test-score reports to parents often are laden with statistical jargon such as norms, percentiles, normal curve equivalents, stanines, lexiles, and proficiency levels that are based on…who knows.

Some parents might see their child’s test score and think it must be wrong because they know their child is better at math than that. (Or reading, or science.) But we rarely hear from parents whose children did much better on the test than they expected, so there is an element of self-selection going on. And their child may have had an off day—illness, family distraction, not having eaten breakfast. Parents should view scores against the backdrop of other indicators of how their child is doing in school such as grades on report cards.

Some debates about tests are built on a flimsy basis. For example, annual state tests do not take up a lot of instructional time. Parents may be concerned that their child’s teacher is “teaching to the test,” but, as described above, tests are designed to measure knowledge about the same standards on which classroom curricula are based.

In that sense, teachers should be “teaching to the test.” Would anyone complain if law-school professors “teach to the bar exam”?

Parents might wonder if they should opt their child out of tests. As an individual act, opting out of tests is like opting out of annual medical checkups — it yields no information and does not make one healthier. As a collective act, opting out erodes what can be learned from scores. If parents of high-performing students in a school all opt out, that school’s average score will be lower (and vice versa for low-performing students). Who is being helped is unclear.

Other debates, such as using student test scores to rate teacher performance, are driven by substantive issues. Few parents want their child’s teacher so preoccupied by ratings that they value nothing but scores. But most approaches for rating teachers place at best a moderate weight on student test scores.

State tests are products of extensive scientific effort and skill over decades. What they are testing reflects what states want their students to learn, the standards. Comparing average scores between schools and districts is possible only because the same test is done. In measuring what students know, tests are a tremendous asset, providing important and reliable information that cannot be learned in other ways.

Our Experts

Stay up to date on the latest stories and events with our newsletter