It all averages out in the end

A dis­cus­sion else­where has sparked a tan­gen­tial thought on grad­ing. Some­thing that’s been both­er­ing me for years. I think rather than pound my walk­ing stick on the table here (as I am wont to do), or drift the thread in the wrond direc­tion there, I’ll just pro­pose it as a series of lead­ing ques­tions and biased examples.

Sup­pose you’re the instruc­tor of a class of 100 under­grad­u­ate stu­dents. Your goal is to give them grades. That’s because the Uni­ver­sity demands a sin­gle grade in the end. And because it’s con­ve­nient for you to just write down one num­ber and be done. And because that’s the way they’ve always done it, and what was good enough for you… OK, sorry, that’s the walk­ing stick.

Sup­pose for the sake of my pretty pic­tures that you’ve given your stu­dents two big graded assign­ments: a 40-​​page research paper that they had four weeks to com­plete, which was graded on com­pre­hen­sive­ness, insight, and clar­ity [but given one grade — oops]; and a final exam which they took in a cold win­ter lec­ture hall in an unfa­mil­iar part of cam­pus, which focused for two har­ried hours on their mas­tery of domain knowl­edge. Of course these two [or four] tasks demand quite dif­fer­ent skills, and occurred on dif­fer­ent days, and under dif­fer­ent stress lev­els and social con­di­tions. But you’re a grader, see, and you make grades. It’s what you’re for. So here are their grades:

(These are just ran­dom num­bers I’ve picked inde­pen­dently from two nor­mal dis­tri­b­u­tions with dif­fer­ent means and stan­dard devi­a­tions, and trun­cated at 0 and 100. You can play with it later. I’ve had grade books that looked some­thing like this, so I’ll pre­sume this is a real­is­tic enough model for our pur­poses. I’m also elid­ing the fact that these are sam­ples; maybe I should have said that the first is the sum of 20 home­work grades, and the sec­ond the sum of 10 oral pre­sen­ta­tions. That’s for another day….)

Decem­ber 22, and grades are due. And your grant pro­posal is over­due! Shit. Who gets an A, and who gets an F?

You know the answer, of course: You add the points up, and look at quan­tiles of some sort. Duh.

But hey, some instruc­tors like to game this a bit. They sense that a point on the big paper is worth more than a point on the final, so they weight that project a bit more heav­ily. The result­ing final score is, say (2*g1 + g2). But wait! In the phys­i­ol­ogy and organic chem­istry classes I suf­fered through as a wee child, the “words” part was entirely dis­re­garded, and the point of the class was how much you recalled. I’d say that final grade was more like (g1 + 9*g2).

Hmmm. Oh, man, you’re never gonna get that pro­posal done at this rate. So dif­fer­ent stu­dents apply dif­fer­ent lev­els of abil­ity to the tasks, which are them­selves notice­ably dif­fer­ent (look at the vari­ance). And dif­fer­ent instruc­tors apply dif­fer­ent affine com­bi­na­tions of these mul­ti­ple grades to come up with a sin­gle score? And don’t for­get that the Uni­ver­sity is aver­ag­ing those scores, and hand­ing out mil­lions of dollars.

(0.67g1 + 0.33g2) and (0.11g1 + 0.99g2) give very dif­fer­ent rank-​​orderings. Let’s look at how the stu­dents in our exam­ple class fare over the range of all pos­si­ble such affine com­bi­na­tions. That is, from the touchy-​​feely instruc­tor who gives them a grade based on 100% of the paper and 0% of the final, to the Mem­ory Mis­tress who throws the wordy paper away and just keeps the facts, ma’am.

Here we go:

Put the grant appli­ca­tion away and get your grade­book out. Your sim­ple choice of weights can make a huge dif­fer­ence in the result­ing grade of the stu­dents. By select­ing one of these many affine com­bi­na­tions — even the “add them together” one that is (0.5,0.5) — you impose a whole slew of your assump­tions on the ranks of the stu­dents. Peo­ple who aced the exam can still come in at the mid­dle of the range (or the bot­tom; wit­ness the two extremal lines on the upper left and the lower right of the spaghetti plot, which are peo­ple on the extremes).

OK. Time for the walking-​​stick now. I’ll not bran­dish it (I’ll save that for the next time I’m forced to review your ridicu­lous man­u­script, or sit in your audi­ence at a con­fer­ence), but just gen­tly indi­cate where I’ve leaned it against the table, with a glance and a sub­tly raised eye­brow. This par­tic­u­lar walk­ing stick rep­re­sents my defense against the mind­less, habit­ual, prej­u­di­cial igno­rance that dri­ves peo­ple — engi­neers, sci­en­tists, instruc­tors, bankers, and frankly any­body who uses num­bers to make deci­sions — to just add shit together. Where “shit” is meant to imply apples and oranges, or risk and return, or cost and safety, or effi­cacy and tox­i­c­ity… or writ­ing abil­ity and mem­o­riza­tion skill.

That’s tacit prej­u­dice. Not “exper­tise”, which is what a lot of right­eous peo­ple claim when they are chal­lenged — it would only be exper­tise if there wasn’t another way to do it. You think you know writ­ing is 23 as impor­tant as the final? And you know the final demanded exactly four times as much domain knowl­edge as the quizzes? And you know that each stu­dent will need that exact, blessed ratio to be suc­cess­ful in life? Bullshit.

OK, I’m putting the stick down. Sorry.

Let’s review: You want to give the high­est grade to the peo­ple who simul­ta­ne­ously did the best on a num­ber of dif­fer­ent mea­sured scores. But there are some peo­ple who do well on one task, and poorly on oth­ers. As long as these tasks are not repeated sam­ples of the same ran­dom vari­able, you can’t jus­tify any par­tic­u­lar affine com­bi­na­tion of them with­out invok­ing your imag­i­nary tacit knowl­edge of “how it should be”. The order of the stu­dents changes a lot in that spaghetti graph up there. How cer­tain are you of your one choice along the x-​​axis?

We learn single-​​objective (scalar) opti­miza­tion first, and rarely ever hear about mul­ti­ob­jec­tive (vec­tor) opti­miza­tion. This is jus­ti­fied by say­ing that the intu­ition and tools devel­oped for scalar opti­miza­tion “just scale nat­u­rally” to vec­tors, but (a) that’s a lie, and (b) scalar opti­miza­tion is thought to be “eas­ier” and © it was invented first, so more peo­ple have heard about it. Prej­u­dice, in other words. Scalar opti­miza­tion is a spe­cial case of multi-​​objective, and a weird one at that.

Look at how stu­pid you have to sound to be able to com­mu­ni­cate the sim­ple idea of dom­i­nance: Exam­ine a list of scalar (single-​​task) grades. The order is obvi­ous, right? Best is high­est, worst is low­est, and there are some rare ties. But the order can also be stated this way: “the best sam­ple is the one such that the fewest num­ber of other sam­ples have larger val­ues, and the worst is the sam­ple such that the most other sam­ples have higher values.”

Doesn’t that sound dumb? But that’s how we talk about dom­i­nance in mul­ti­ple dimen­sions. The best sam­ple is non-​​dominated, and the worst sam­ple is dom­i­nated by the most others.

Just as in the scalar case, in mul­ti­ob­jec­tive sit­u­a­tions, we say that one sam­ple strictly dom­i­nates another when it is simul­ta­ne­ously higher on the basis of all objec­tives. See how that maps to one task? But it also sounds like our grad­ing issue, doesn’t it? “You want to give the high­est grade to the peo­ple who simul­ta­ne­ously did the best on a num­ber of dif­fer­ent mea­sured scores.”

Let’s revisit our stu­dents. The grade of A goes to the stu­dents whose scores are dom­i­nated by the fewest of their peers; these are the ones at the edges of the upper right of Fig­ure 1 above. Notice that the stu­dent at the high­est paper grade, and the one at the high­est final exam grade will both get As under this sys­tem. Note also that this order does not depend on lin­ear scale of the axes: you can mul­ti­ply the scores by what­ever (pos­i­tive nonzero) num­bers you want, and you’ll still end up with the same exact rank-​​ordering.

That sounds all weird. Sounds like the result will be peo­ple get­ting high grades who don’t deserve them, doesn’t it?

Deserve them on what basis? On your pre­sump­tu­ous affine com­bi­na­tion basis, is what you mean. Go look at Fig­ure 2 again, when­ever you think your favorite affine com­bi­na­tion is best, and just notice that all the choices peo­ple tend to use (towards the mid­dle) are in the region where the strict order of stu­dents is chang­ing most rapidly. A slight error or devi­a­tion to a choice made in that region can cause may­hem in the exact orders. How does that make you feel? More confident?

But here’s a life­line: Let’s order the stu­dents accord­ing to the good old add-‘em-up (0.5,0.5) affine com­bi­na­tion, and plot that score against a rank­ing based on the num­ber of other stu­dents whose grades strictly dom­i­nate theirs. (what I’ve done is sub­tract the num­ber of stu­dents that dom­i­nate from 100, so both ranges are 0–100, increas­ing). Here:

The hor­i­zon­tal axis is the mul­ti­ob­jec­tive rank-​​order. The ones over towards the right are beaten on all tasks simul­ta­ne­ously by the fewest of their peers, and the ones over towards the left are the worst at every­thing simul­ta­ne­ously (again — this is 100% inde­pen­dent of the rel­a­tive weights). And the ver­ti­cal axis is the tra­di­tional 5050 aver­aged score. Peo­ple near the top of that range are the ones that scored the most points over­all, and peo­ple on the bot­tom are the ones who scored the fewest overall.

Per­son­ally, I kindof like it. These two ranks are strongly cor­re­lated, which implies that this approach shouldn’t vio­late too many of your trea­sured assump­tions. But it also man­ages to allow stu­dents to excel at their strengths, and for­gives them for their weak­nesses. The “trou­bling” part might be that the tra­di­tion­ally “worst” stu­dents tend to get the biggest boosts. There’s that one, down in the lower right, who would surely have failed the traditionally-​​graded class but who is in the upper ranges in the mul­ti­ob­jec­tive ranks. Why is that?

It’s because I picked the data so the two tasks (paper and final) are uncor­re­lated with one another. What would hap­pen if they were strongly cor­re­lated? Draw it and see; the more cor­re­lated the mul­ti­ple grades, the closer the ranks will be to the tra­di­tional aver­age. But what if they are more strongly anti–cor­re­lated? Here are the stu­dents’ grades for an oral pre­sen­ta­tion plot­ted against their paper grade (same ones as above):

You still want to add these two scores up? Even though it’s clear that peo­ple who do well on one assign­ment tend to do worse on the other? Eye­balling it, it seems to me that the highest-​​scoring per­son on the pre­sen­ta­tion is going to get a below-​​median grade that way. You want that?

You’re con­fi­dent that pre­sen­ta­tion skills are exactly equiv­a­lent to final exam skills? I see. So that’s why you always put writ­ing your talk off to the last minute.

No, really, lets talk about cor­re­la­tion for a minute. Let’s sup­pose that the big com­po­nents we’re talk­ing about here are qual­i­ta­tively dif­fer­ent tasks: quizzes and home­works and speak­ing and singing and doing proofs and writ­ing pro­grams and explain­ing things and doing library research and writ­ing well and mak­ing pre­ci­sion parts, say. If you plot some pair of these, and there’s a huge cor­re­la­tion in the stu­dents’ actual scores, what does that mean? If they’re anti­cor­re­lated, what does that mean? About you, and about your stu­dents, and about your “test­ing instru­ments”? About dif­fer­en­tial abil­ity on both sides of the line?

As a decent instruc­tor, you will give mul­ti­ple assign­ments to your stu­dents. Whether you acknowl­edge it or not, the grades on those assign­ments will not only be sub­ject to vari­abil­ity, but the assign­ments them­selves will be dif­fer­ent. Some assign­ments will inevitably stress dif­fer­ent skills. Your stu­dents have het­ero­ge­neous abil­ity and apti­tude lev­els at those mul­ti­ple skills.

Which of their skills is bet­ter? How much? And who are you to choose?

That last is not a ques­tion to offend. That’s a ques­tion about the nature of decisionmaking.