The Algorithm Didn’t Like My Essay
By RANDALL STROSS
AS a professor and a parent, I have long dreamed of finding a software program that helps every student learn to write well. It would serve as a kind of tireless instructor, flagging grammatical, punctuation or word-use problems, but also showing the way to greater concision and clarity.
Now, unexpectedly, the desire to make the grading of tests less labor-intensive may be moving my dream closer to reality.
The standardized tests administered by the states at the end of the school year typically have an essay-writing component, requiring the hiring of humans to grade them one by one. This spring, the William and Flora Hewlett Foundation sponsored a competition to see how well algorithms submitted by professional data scientists and amateur statistics wizards could predict the scores assigned by human graders.The winners were announced last month — and the predictive algorithms were eerily accurate.
The competition was hosted by Kaggle, a Web site that runs predictive-modeling contests for client organizations — thus giving them the benefit of a global crowd of data scientists working on their behalf. The site says it “has never failed to outperform a pre-existing accuracy benchmark, and to do so resoundingly.”
Kaggle’s tagline is “We’re making data science a sport.” Some of its clients offer sizable prizes in exchange for the intellectual property used in the winning models. For example, the Heritage Health Prize (“Identify patients who will be admitted to a hospital within the next year, using historical claims data”) will bestow $3 million on the team that develops the best algorithm.
The essay-scoring competition that just concluded offered a mere $60,000 as a first prize, but it drew 159 teams. At the same time, the Hewlett Foundation sponsored a study of automated essay-scoring engines now offered by commercial vendors. The researchers found that these produced scores effectively identical to those of human graders.
Barbara Chow, education program director at the Hewlett Foundation, says: “We had heard the claim that the machine algorithms are as good as human graders, but we wanted to create a neutral and fair platform to assess the various claims of the vendors. It turns out the claims are not hype.”
If the thought of an algorithm replacing a human causes queasiness, consider this: In states’ standardized tests, each essay is typically scored by two human graders; machine scoring replaces only one of the two. And humans are not necessarily ideal graders: they provide an average of only three minutes of attention per essay, Ms. Chow says.
We are talking here about providing a very rough kind of measurement, the assignment of a single summary score on, say, a seventh grader’s essay, not commentary on the use of metaphor in a college senior’s creative writing seminar.
Software sharply lowers the cost of scoring those essays — a matter of great importance because states have begun to toss essay evaluation to the wayside.
“A few years back, almost all states evaluated writing at multiple grade levels, requiring students to actually write,” says Mark D. Shermis, dean of the college of education at the University of Akron in Ohio. “But a few, citing cost considerations, have either switched back to multiple-choice format to evaluate or have dropped writing evaluation altogether.”
As statistical models for automated essay scoring are refined, Professor Shermis says, the current $2 or $3 cost of grading each one with humans could be virtually eliminated, at least theoretically.
As essay-scoring software becomes more sophisticated, it could be put to classroom use for any type of writing assignment throughout the school year, not just in an end-of-year assessment. Instead of the teacher filling the essay with the markings that flag problems, the software could do so. The software could also effortlessly supply full explanations and practice exercises that address the problems — and grade those, too.
Tom Vander Ark, chief executive of OpenEd Solutions, a consulting firm that is working with the Hewlett Foundation, says the cost of commercial essay-grading software is now $10 to $20 a student per year. But as the technology improves and the costs drop, he expects that it will be incorporated into the word processing software that all students use.
“Providing students with instant feedback about grammar, punctuation, word choice and sentence structure will lead to more writing assignments,” Mr. Vander Ark says, “and allow teachers to focus on higher-order skills.”
Teachers would still judge the content of the essays. That’s crucial, because it’s been shown that students can game software by feeding in essays filled with factual nonsense that a human would notice instantly but software could not.
When sophisticated essay-evaluation software is built into word processing software, Mr. Vander Ark predicts “an order-of-magnitude increase in the amount of writing across the curriculum.”
I SPOKE with Jason Tigg, a London-based member of the team that won the essay-grading competition at Kaggle. As a professional stock trader who uses very large sets of price data, Mr. Tigg says that “big data is what I do at work.” But the essay-scoring software that he and his teammates developed uses relatively small data sets and ordinary PCs — so the additional infrastructure cost for schools could be nil.
Student laptops don’t yet have the tireless virtual writing instructor installed. But improved statistical modeling brings that happy day closer.
Randall Stross is an author based in Silicon Valley and a professor of business at San Jose State University. E-mail: stross.
.