After reading through to the full PDF, I’m unconvinced that the argument it’s making is a good one.
The context of this is in automated grading of student code projects. If you went through a CS degree, you know the kind: you’re given a function to implement, and your function is graded against a test suite. The instructors have a reference implementation that was presumably used to develop that test suite, and whose outputs are taken as the ground truth reference output.
Typically, what instructors do is they try to make really thorough test suites based on the assignment spec, and grade students based on whether or not they pass it. This is what the authors call the “axiomatic” method, and it’s the standard for most automated testing in schools. They seem to argue that it’s too kind to the students and that it lets some students pass when they shouldn’t, an assertion that I already find skeptical. (If you want stricter grading, make more thorough and stricter tests.)
So they say, “Hey, what if we asked the students to make their own test cases, and then added our students’ test cases to our test set? To see whether the tests are viable, we’ll just compare them to our reference implementation, and if the students’ expected test output agrees with our reference implementation, we’ll grade based on that too!” They called this “AlgSing”.
The problem, and they acknowledge this in the paper, is that the assignment spec might not be 100% complete, and when it comes to an ambiguous case, just because your reference implementation does something doesn’t mean it’s the only possible correct implementation. For example, if you’re creating a binary search tree from a list of numbers, when you find a duplicate number you can always put it in either the left or the right subtree, and it makes no sense to penalize students for making a different arbitrary choice than you. If you blithely incorporate student tests into your test set, you’ll end up being way overzealous in failing student implementations.
So what they suggest instead, is that in your reference implementation, every time you encounter an ambiguous case, you do both things, and create multiple different reference implementations. Then you can accept only student test cases where all your reference implementations agree with the expected output. They called this “AlgMult”.
Except that I don’t think this helps much. You’re still potentially grading students on things that were never in the assignment spec, and all you’ve done is to shift the work from writing thorough test cases to writing a thorough set of reference implementations, which is arguably even more work.
And all for what? For the sake of grading your students harder? That isn’t even a theoretical concern: they tested this on real student assignments, and on every assignment AlgMult failed an equal or greater number of students. In one case it went from labeling 28% of student assignments as faulty to 57%. Nearly double.
So I don’t see the value in this. It doesn’t look like it saves work for the instructor, and it could easily still end up with students being penalized for bullshit reasons that you never explicitly communicated. There are better ways to fail students.
After reading through to the full PDF, I’m unconvinced that the argument it’s making is a good one.
The context of this is in automated grading of student code projects. If you went through a CS degree, you know the kind: you’re given a function to implement, and your function is graded against a test suite. The instructors have a reference implementation that was presumably used to develop that test suite, and whose outputs are taken as the ground truth reference output.
Typically, what instructors do is they try to make really thorough test suites based on the assignment spec, and grade students based on whether or not they pass it. This is what the authors call the “axiomatic” method, and it’s the standard for most automated testing in schools. They seem to argue that it’s too kind to the students and that it lets some students pass when they shouldn’t, an assertion that I already find skeptical. (If you want stricter grading, make more thorough and stricter tests.)
So they say, “Hey, what if we asked the students to make their own test cases, and then added our students’ test cases to our test set? To see whether the tests are viable, we’ll just compare them to our reference implementation, and if the students’ expected test output agrees with our reference implementation, we’ll grade based on that too!” They called this “AlgSing”.
The problem, and they acknowledge this in the paper, is that the assignment spec might not be 100% complete, and when it comes to an ambiguous case, just because your reference implementation does something doesn’t mean it’s the only possible correct implementation. For example, if you’re creating a binary search tree from a list of numbers, when you find a duplicate number you can always put it in either the left or the right subtree, and it makes no sense to penalize students for making a different arbitrary choice than you. If you blithely incorporate student tests into your test set, you’ll end up being way overzealous in failing student implementations.
So what they suggest instead, is that in your reference implementation, every time you encounter an ambiguous case, you do both things, and create multiple different reference implementations. Then you can accept only student test cases where all your reference implementations agree with the expected output. They called this “AlgMult”.
Except that I don’t think this helps much. You’re still potentially grading students on things that were never in the assignment spec, and all you’ve done is to shift the work from writing thorough test cases to writing a thorough set of reference implementations, which is arguably even more work.
And all for what? For the sake of grading your students harder? That isn’t even a theoretical concern: they tested this on real student assignments, and on every assignment AlgMult failed an equal or greater number of students. In one case it went from labeling 28% of student assignments as faulty to 57%. Nearly double.
So I don’t see the value in this. It doesn’t look like it saves work for the instructor, and it could easily still end up with students being penalized for bullshit reasons that you never explicitly communicated. There are better ways to fail students.