The Virtual Growlery: Robo-grading

"Essay Grading Software Offers Professors a Break," read the headline in the New York Times. With the usual fanfare according some new and seemingly innovative educational development, the article described the new system developed by edX, a nonprofit corporation set up and sponsored by a group of élite colleges and universities, and which was (until now) best-known for developing software for, and hosting, MOOCs. The president of edX, Dr. Anant Agarwal, is an electrical engineer (of course!) and has overseen this new system with the goal of 'instant feedback' in mind. After all, who wants to wait for a professor to diligently grade an essay by hand, a process which can take -- if you have 25-30 students in a class, as I often do -- a week at least, and sometimes more -- especially when you can get it "graded"instantly online, and instantly work to "improve" it. All of which raises two questions: 1) What is it that professors would be "freed" to do if they didn't do one of the most essential tasks of teaching? -- and 2) How can a computer possibly give meaningful feedback on an essay?

But it's not the first time. Years ago, there was a little test which would, like magic, determine the "readability" and grade level of an essay; it was called the Flesch-Kincaid Grade Level test. It was based on two things that roughly -- very roughly -- correlated with readability and sophistication -- the length of sentences (surely longer ones were more challenging to read) and the length of words (given that many technical and specialized words of more than two syllables are derived from Latin or Greek, this again offered a sort of metric. Of course one could write none but brief words -- low grade level! Or someone preparing to compose a disquisition upon some substantive matter of polysyllabic significance could readily simulate the species of composition that would elicit a higher plateau of evaluative achievement. Not surprisingly, the Flesch-Kincaid test was initially developed by the US military in 1948, but it took on a new life when Microsoft Word included its metrics as an add-on to its built in spelling and grammar checker. By its metrics, the lowest-rated texts would be those packed with monosyllables, such as Dr. Seuss's Green Eggs and Ham, while a long-winded theological or legal treatise loaded with multi-syllable words word score high.

So how does edX's system work? Not surprisingly, it takes a page from the neural network systems developed years ago at MIT to handle complex tasks like parsing language or manipulating a robot hand. The idea of a neural network is that it programs itself by repeating a task over and over, getting feedback as to the success of each result. When the feedback is good, the network "remembers" that behavior, and prioritizes whatever it did to achieve it; when feedback is bad, routines are dropped, or changed and tried again. It's not unlike the way babies learn simple motor skills.

And so, in order to judge whether an essay is "good," the edX system asks for 100 essays, essays already graded by professors. It internalizes all the patterns it can find in the essays marked as "good" or "bad," and then tests itself by applying these to additional papers; if its results match those of the human graders, it regards that outcome as "good" and works to replicate it. Of course, such a system can only possibly be as good as whatever the professors who use it think is good; it might well be that what is good at Ho-Ho-Kus Community College is not so good at Florida A&T or Cal Poly. And the demands of different assignments might demand different metrics, or might even vary over time; such a machine would need regular re-calibration.

But can such a computer program be said to really be evaluating these essays? No. It only works to be predictive of the grade that a human grader would tend to assign. And, with so-called "outliers" -- papers that are unusually good, or unusually bad, its rate of error could be quite high. If we imagine a paper which breaks the usual rules of writing in order to obtain a certain effect, such a paper might indeed get very high marks from a human grader, but be flunked by a machine which believes there is no such thing as a good reason to violate a grammatical or structural rule.

So we're back to square one. If there were a large lecture where a standard sort of essay was expected, with very strict parameters, a program like this might be effective at matching its assessments to those of human evaluators. But this isn't how any college or university in fact teaches writing; in the best programs, the classes are small, the assignments varied and often have elements of creative writing, and the level of individual attention -- and variation -- is high. Replacing professors in these classes with robo-graders would almost certainly result in much poorer learning.

And what are we freeing these professors up to do? What higher calling awaits those "freed" from grading essays? Recent surveys show that the average tenured or tenure-line professor in higher education today is teaching fewer classes than ever before; the figure was once over three courses per semester, and is now falling closer to two. Of course, at some smaller liberal-arts colleges, such as the one I teach at, our standard load is three; I myself teach four a semester, as well as four every summer -- twelve courses a year in all (hey, I've got kids in college myself, and I need the "extra" pay!). And somehow despite all that grading I've managed to write three books and dozens of articles. While, at the big research universities, some professors get so much course relief that they teach as few as two courses a year -- over my career, I'll have taught more courses that six such professors. So I don't think the argument holds that professors need to be "freed up" from anything, unless they're teaching multiple large lectures, in which case they doubtless have teaching assistants anyway.

So go ahead, robo-grade your papers. Give your lectures to a video camera and have everyone just watch your MOOC. At that point, you don't really need to be on campus anyway, so why not just take an extended vacation? But if the parents who are laboring and borrowing to gather up the funds to pay the tuition that pays your salary start to feel that your vacation should be a permanent one, don't be surprised.

The Virtual Growlery

Friday, April 5, 2013

Robo-grading

1 comment: