# more on that “third language”

I am pleased to see my friend Cosma Shal­izi writ­ing at length, and from a sub­stan­tively dif­fer­ent domain, about the sense I was try­ing to express ear­lier today of per­sonal and cul­tural ten­sions between the “lan­guage of answers”, the “lan­guage of look­ing for answers”, and the “lan­guage of ask­ing ques­tions when pre­sented with answers”.

If you squint, I mean. Sta­tis­ti­cal mod­els are “answers”. Sta­tis­ti­cal prac­tice is “look­ing”. The desire to model is itself “ask­ing”, and the more I look around at sta­tis­tics (espe­cially machine learn­ing and genetic pro­gram­ming), oper­a­tions research, and the other heav­ily math­e­ma­tized fields of engi­neer­ing endeavor—up to and includ­ing programming—the more I real­ize how poorly and infre­quently we nerds talk about why and how.

What we need is some kind of, I dunno, crit­i­cism….

# What does this remind you of, Nerdy?

This is just a tele­graphic note to get this out of my head and onto paper quickly.

I’m think­ing about crowd­sourced proof­read­ing, or rec­om­mender sys­tems, or any of sev­eral other prob­lems in bipar­tite net­work the­ory (to get all nerdy and crap). In gen­eral, sup­pose there are a col­lec­tion $M$ peo­ple, and each of them is look­ing at some sub­set of $N$ objects.

If they’re proof­read­ers look­ing at tran­scribed pic­tures of pages and check­ing for errors, if per­son $m_i$ checks and signs off a page $n_j$, then we con­nect node $m_i$ to $n_j$ with an edge.

If they’re web surfers look­ing at sites and tweets and shit and check­ing for awe­some­ness, if per­son $m_i$ vis­its and rec­om­mends a URI $n_j$, then we con­nect node $m_i$ to $n_j$ with an edge.

And so on.

We all know a lot of dif­fer­ent ways of scor­ing the things these peo­ple are look­ing at, don’t we Nerdy? PageR­ank, a bunch of net­work the­ory things, &c &c.

Me, I’m mus­ing about some­thing slightly dif­fer­ent. I think. I’m think­ing about con­fir­ma­tion, and col­lab­o­ra­tion, and how the step from one to two peo­ple vouch­ing for some­thing is “worth” more than the step from 17 to 18 peo­ple vouch­ing for it.

Let’s score things this way:

1. For each per­son, observe the set of objects to which they’re linked. Sup­pose for exam­ple that Ada linked to items #1, #2 and #3, Byron linked to item #2, and Charles linked to item #2.
2. Sim­i­larly, for each item, col­lect the set of peo­ple who are linked to it. In our exam­ple, item #1 is linked from Ada, item #2 from Ada, Byron and Charles, and item #3 from Ada.
3. The score of a per­son is increased by 1 for every item to which they link, to which a unique sub­set of peo­ple link. In our exam­ple, Ada links to item #1 (worth 1 point), item #2 (worth another point, because $\{A\} \neq \{A, B, C\}$), and item #3 (not worth more points, since item #1 is linked from the same sub­set of peo­ple). Byron’s score is 1 because he links to item #2; so is Charles’s.
4. The score of an item is [some sta­tis­tic of] the scores of the peo­ple linked to it. Maybe aver­age, maybe sum; doesn’t really mat­ter to me just now.

The inter­est­ing thing to me at the moment is under­stand­ing the dynam­ics of dis­cov­ery as a game here. Ada hasn’t got any way to increase her score with­out dis­cov­er­ing some item #4. If Byron links to items #1 or #3, he increases his score, and also Ada’s score indi­rectly, because she will then link to three items with dif­fer­ent “audi­ences”. But if Charles fol­lows suit, and links to the same item Byron does, that shared advan­tage dis­ap­pears for both of them. If every­body links to every­thing, the scores are eroded away to 1 across the board.

So what prob­lem am I try­ing to solve here?

In the anti­quated crowd­sourc­ing sys­tem of Dis­trib­uted Proof­read­ers, there have been numer­ous seri­ous issues with qual­ity and diver­sity through the years. Orig­i­nally proof­read­ers’ “scores” were just the num­ber of pages they read and signed off as being “cor­rect”; this of course led to click-​​through gam­ing that didn’t actu­ally improve the qual­ity of texts. But there’s still a lot of money on the table there, since I often see books in which a dozen con­sec­u­tive pages have been proof­read by the same pair of people.

This, to me, is a con­cern. Because of Scott Page’s work, among other things.

There’s a Goldilocks point here. If every­body does non-​​overlapping work, there is no com­mu­nity, and no chance for check­ing one another’s work. If every­body does every­thing, there’s a lot of redun­dancy in the work, and returns dimin­ish quickly. But some­where in between is a prob­lem of sub-​​community struc­ture: if two peo­ple con­sis­tently apply them­selves to the same work, there must be an accom­pa­ny­ing dimin­ish­ment in their joint contribution.

What I’m try­ing to do here is reward coverage.

Sim­i­larly, I’m con­cerned with the sus­tain­able qual­ity of links gen­er­ated by crowd­sourced sys­tems like deli​cious​.com or pin​board​.in—or even Google PageRank—as bots spew random-​​seeming links across the net­work. At the moment I have a search win­dow open for “genetic pro­gram­ming” on Twit­ter, and sev­en­teen of the eigh­teen hits are bots that sim­ply link to Ama­zon affil­i­ated books.

So any­way: just thinking.

If I had a col­lec­tion of $M$ play­ers and $N$ items, what strat­egy for link­ing to items would max­i­mize the score of an indi­vid­ual player? The aver­age score of the entire collection?

Later: Of course, I’ve done a thing I usu­ally yell at other peo­ple for doing. I have sev­eral goals in mind, and I may be con­flat­ing one or two of them.

Let me talk about it in terms of aggre­gated sys­tem design goals, rather than indi­vid­ual game dynam­ics for the “peo­ple” in my sketched model: We agree we want to max­i­mize the num­ber of links between peo­ple and things (in the two exam­ples I’ve men­tioned, and in oth­ers like maybe polit­i­cal engage­ment, or club mem­ber­ship). The goal of “con­fir­ma­tion” means we also want to max­i­mize the num­ber of peo­ple look­ing at each item, so that we have many eyes look­ing at each thing at once.

My fil­lip here is that I’m sug­gest­ing that we should simul­ta­ne­ously try to max­i­mize the diver­sity of sub­sets of peo­ple who have looked at items, to reduce cor­re­la­tion between people’s atten­tion to items as much as possible.

# for the prettiest ones

A lively break­fast con­ver­sa­tion with old friend Cosma Shal­izi this morn­ing leaves me hang­ing on two or three tech­ni­cal con­fu­sions. Basi­cally I but­ton­holed him and made him buy me break­fast today because I’m try­ing to under­stand his peo­ple better.

Sta­tis­ti­cians, I mean.

In par­tic­u­lar, I wanted to talk at him a while about what I see and hear in his student’s recent research project.

Some head­way was made. But the time and effort spent trans­lat­ing from the [what­ever the fuck it is I speak] to a series of rel­e­vant and seman­ti­cally reasonable-​​sounding noises meant I didn’t quite get far enough to close all the open ques­tions I had.

So I’m just going to dump them here. In more or less unedited form.

File this all under either “notes to self” or “cramped rants scrib­bled on ceil­ing of patient’s cell”, as you prefer.

### For the pret­ti­est ones

I didn’t get to a basic ques­tion about sta­tis­tics that’s been devel­op­ing in my prac­tice lately. Viz: when are they nec­es­sary, and when are they even a good idea?

By “sta­tis­tics” I don’t mean Statistics—the field—but rather the math­e­mat­i­cal func­tions we use to sum­ma­rize sets of obser­va­tions: mean, vari­ance, max­i­mum, mode, range, and so forth.

Sup­pose you’re mod­el­ing some dataset con­tain­ing $e$ exam­ple train­ing points, using a model

$\mathbf{M}: \hat{y} = \hat{f}(x)$, where

$x \in \Re^{m}, y \in \Re$

Apply­ing some par­tic­u­lar model $\mathbf{M}_1$ to the dataset pro­duces a vec­tor of pre­dic­tions $\hat{y}_1 \in \Re^{e}$, which we want to com­pare to our expected responses $\overrightarrow{y} \in \Re^{e}$. And given some other model $\mathbf{M}_2$, we’ll get a dif­fer­ent vec­tor of pre­dicted val­ues (one for each train­ing input), $\hat{y}_2 \in \Re^{e}$.

Now nor­mally we drop from this big honk­ing e-​​dimensional space into one (1) dimen­sion by doing some sta­tis­tics at this point. Adding the absolute devi­a­tions up, or resid­ual sum of squares: all those stan­dard mea­sures of model fit­ting. Those func­tions of ran­dom vari­ables are “statistics”.

As I under­stand it from the side­lines, in the Sta­tis­tics com­mu­nity (note the big ‘S’) there’s a lot of time spent work­ing out modes and man­ners for model selec­tion. These amount to com­par­i­son met­rics which project these vec­tors I’ve described into var­i­ous lower-​​dimensional spaces, with accom­pa­ny­ing caveats about spe­cial con­di­tions under which the assump­tions under­ly­ing that given pro­jec­tion may fail, or be ques­tion­able. Par­tic­u­lar kinds of mod­els, par­tic­u­lar biases in the datasets, and unusual inter­ac­tions between those two.

Now sup­pose for the sake of build­ing up slowly in this argu­ment, we have one (1) input:output pair in our train­ing set; one vec­tor of $\overrightarrow{x}$ val­ues, in a tuple with a sin­gle expected $\overrightarrow{y} \in \Re^1$ value. Given two mod­els $\mathbf{M}_1$ and $\mathbf{M}_2$, if I plug those $\overrightarrow{x}$ val­ues in I’ll get pre­dic­tions $\hat{y}_1 = (\hat{y}_{1,1})$ and $\hat{y}_2 = (\hat{y}_{2,1})$.

Clearly, the best of these two mod­els the one whose dis­tance from the expected $\overrightarrow{y}$ is small­est. In one dimen­sion, it doesn’t really mat­ter to us which of the met­ric norms you pre­fer; you can use absolute dif­fer­ence and be intu­itive about “dis­tance in one dimen­sion”, or square the dis­tances if you want to [pre­ma­turely] plan ahead for higher-​​dimensional cases.

In this triv­ial case with one train­ing point, let’s just say we pick $L_0$ norm—absolute value—and we’ll stick with it as we move up the dimen­sions. OK?

Now sup­pose we have two (2) input:output pairs in our train­ing set; two tuples of $\overrightarrow{x}$ val­ues and accom­pa­ny­ing expected $\overrightarrow{y}$ val­ues. Given our two mod­els $\mathbf{M}_1$ and $\mathbf{M}_2$, when I plug those two sets of $x$ val­ues in I’ll get pre­dic­tions $\hat{y}_1 = (\hat{y}_{1,1},\hat{y}_{1,2}) \in \Re^2$ and $\hat{y}_2 = (\hat{y}_{2,1},\hat{y}_{2,2}) \in \Re^2$.

Let’s just dis­miss Dok­tor Gauss for a while. Send him out of the room.

When I say that $\hat{y}_1$ dom­i­nates $\hat{y}_2$ under some met­ric $\mathbf{L}$ against $\overrightarrow{y}$, I mean that for every train­ing case in our dataset, $\mathbf{L}(\hat{y}_{1,i},\overrightarrow{y}_i) \leq \mathbf{L}(\hat{y}_{2,i},\overrightarrow{y}_i)$ and for at least one train­ing case, $\mathbf{L}(\hat{y}_{1,i},\overrightarrow{y}_i) \mbox{is less than} \mathbf{L}(\hat{y}_{2,i},\overrightarrow{y}_i)$.

Notice that for any norm, the order of obser­va­tions doesn’t actu­ally change. We’re not aggre­gat­ing them over all train­ing cases, we’re pro­duc­ing vec­tors that can be stretched or squared—as long as you do it inde­pen­dently across train­ing data—without chang­ing the dom­i­na­tion rela­tion­ship. If some model $\mathbf{M}_1$ dom­i­nates some $\mathbf{M}_2$ given the train­ing data, it doesn’t mat­ter whether we square the resid­u­als, or take their absolute val­ues, or what­ever. As long as we don’t aggre­gate them.

In other words, I’m won­der­ing whether it’s pos­si­ble to keep “fit with regards to a given train­ing tuple” sep­a­rate, and use stan­dard multi-​​objective rank­ing meth­ods to dis­crim­i­nate mod­els that are dom­i­nated from those that are non-​​dominated, given a par­tic­u­lar train­ing set.

Mul­ti­cri­te­rion sort­ing is a par­tial order­ing rela­tion­ship, and that means there are often—especially in higher dimensions—mutually non­dom­i­nated points. But that doesn’t mean it’s dif­fer­ent from tra­di­tional scalar order­ing, which is just a degen­er­ate case: there are ties in races, after all.

So what does hold­ing back on aggre­ga­tion do?

Well, for one thing, it brings into focus the notion of broad and wide-​​ranging fam­i­lies of mod­els, not the scant hand­ful many sta­tis­ti­cians are used to work­ing with.

It leads to some inter­est­ing pos­si­bil­i­ties in under­stand­ing the rela­tion­ship between model per­for­mance over test data (gen­er­al­iza­tion) and train­ing data.

It leads to the use­ful notion of sheafs of mul­ti­ple non­dom­i­nated mod­els, some “spe­cial­ized” for mod­el­ing one train­ing point, oth­ers spe­cial­ized in mod­el­ing other data points, and a few gen­er­al­ists that do quite well on all of them.

It opens up some inter­est­ing ques­tions (to me, at least) about leave-​​one-​​out val­i­da­tion meth­ods. Espe­cially about how robust rank­ings might be.

Finally, it seems that it opens a door for the sort of “data bal­anc­ing” work Katya Vladislavl­eva has made such progress with… and maybe a direc­tion through which it can be com­mu­ni­cated to the folks on the machine learn­ing side.

We’ll see.