# for the prettiest ones

A lively break­fast con­ver­sa­tion with old friend Cosma Shal­izi this morn­ing leaves me hang­ing on two or three tech­ni­cal con­fu­sions. Basi­cally I but­ton­holed him and made him buy me break­fast today because I’m try­ing to under­stand his peo­ple better.

Sta­tis­ti­cians, I mean.

In par­tic­u­lar, I wanted to talk at him a while about what I see and hear in his student’s recent research project.

Some head­way was made. But the time and effort spent trans­lat­ing from the [what­ever the fuck it is I speak] to a series of rel­e­vant and seman­ti­cally reasonable-​​sounding noises meant I didn’t quite get far enough to close all the open ques­tions I had.

So I’m just going to dump them here. In more or less unedited form.

File this all under either “notes to self” or “cramped rants scrib­bled on ceil­ing of patient’s cell”, as you prefer.

### For the pret­ti­est ones

I didn’t get to a basic ques­tion about sta­tis­tics that’s been devel­op­ing in my prac­tice lately. Viz: when are they nec­es­sary, and when are they even a good idea?

By “sta­tis­tics” I don’t mean Statistics—the field—but rather the math­e­mat­i­cal func­tions we use to sum­ma­rize sets of obser­va­tions: mean, vari­ance, max­i­mum, mode, range, and so forth.

Sup­pose you’re mod­el­ing some dataset con­tain­ing $e$ exam­ple train­ing points, using a model

$\mathbf{M}: \hat{y} = \hat{f}(x)$, where

$x \in \Re^{m}, y \in \Re$

Apply­ing some par­tic­u­lar model $\mathbf{M}_1$ to the dataset pro­duces a vec­tor of pre­dic­tions $\hat{y}_1 \in \Re^{e}$, which we want to com­pare to our expected responses $\overrightarrow{y} \in \Re^{e}$. And given some other model $\mathbf{M}_2$, we’ll get a dif­fer­ent vec­tor of pre­dicted val­ues (one for each train­ing input), $\hat{y}_2 \in \Re^{e}$.

Now nor­mally we drop from this big honk­ing e-​​dimensional space into one (1) dimen­sion by doing some sta­tis­tics at this point. Adding the absolute devi­a­tions up, or resid­ual sum of squares: all those stan­dard mea­sures of model fit­ting. Those func­tions of ran­dom vari­ables are “statistics”.

As I under­stand it from the side­lines, in the Sta­tis­tics com­mu­nity (note the big ‘S’) there’s a lot of time spent work­ing out modes and man­ners for model selec­tion. These amount to com­par­i­son met­rics which project these vec­tors I’ve described into var­i­ous lower-​​dimensional spaces, with accom­pa­ny­ing caveats about spe­cial con­di­tions under which the assump­tions under­ly­ing that given pro­jec­tion may fail, or be ques­tion­able. Par­tic­u­lar kinds of mod­els, par­tic­u­lar biases in the datasets, and unusual inter­ac­tions between those two.

Now sup­pose for the sake of build­ing up slowly in this argu­ment, we have one (1) input:output pair in our train­ing set; one vec­tor of $\overrightarrow{x}$ val­ues, in a tuple with a sin­gle expected $\overrightarrow{y} \in \Re^1$ value. Given two mod­els $\mathbf{M}_1$ and $\mathbf{M}_2$, if I plug those $\overrightarrow{x}$ val­ues in I’ll get pre­dic­tions $\hat{y}_1 = (\hat{y}_{1,1})$ and $\hat{y}_2 = (\hat{y}_{2,1})$.

Clearly, the best of these two mod­els the one whose dis­tance from the expected $\overrightarrow{y}$ is small­est. In one dimen­sion, it doesn’t really mat­ter to us which of the met­ric norms you pre­fer; you can use absolute dif­fer­ence and be intu­itive about “dis­tance in one dimen­sion”, or square the dis­tances if you want to [pre­ma­turely] plan ahead for higher-​​dimensional cases.

In this triv­ial case with one train­ing point, let’s just say we pick $L_0$ norm—absolute value—and we’ll stick with it as we move up the dimen­sions. OK?

Now sup­pose we have two (2) input:output pairs in our train­ing set; two tuples of $\overrightarrow{x}$ val­ues and accom­pa­ny­ing expected $\overrightarrow{y}$ val­ues. Given our two mod­els $\mathbf{M}_1$ and $\mathbf{M}_2$, when I plug those two sets of $x$ val­ues in I’ll get pre­dic­tions $\hat{y}_1 = (\hat{y}_{1,1},\hat{y}_{1,2}) \in \Re^2$ and $\hat{y}_2 = (\hat{y}_{2,1},\hat{y}_{2,2}) \in \Re^2$.

Let’s just dis­miss Dok­tor Gauss for a while. Send him out of the room.

When I say that $\hat{y}_1$ dom­i­nates $\hat{y}_2$ under some met­ric $\mathbf{L}$ against $\overrightarrow{y}$, I mean that for every train­ing case in our dataset, $\mathbf{L}(\hat{y}_{1,i},\overrightarrow{y}_i) \leq \mathbf{L}(\hat{y}_{2,i},\overrightarrow{y}_i)$ and for at least one train­ing case, $\mathbf{L}(\hat{y}_{1,i},\overrightarrow{y}_i) \mbox{is less than} \mathbf{L}(\hat{y}_{2,i},\overrightarrow{y}_i)$.

Notice that for any norm, the order of obser­va­tions doesn’t actu­ally change. We’re not aggre­gat­ing them over all train­ing cases, we’re pro­duc­ing vec­tors that can be stretched or squared—as long as you do it inde­pen­dently across train­ing data—without chang­ing the dom­i­na­tion rela­tion­ship. If some model $\mathbf{M}_1$ dom­i­nates some $\mathbf{M}_2$ given the train­ing data, it doesn’t mat­ter whether we square the resid­u­als, or take their absolute val­ues, or what­ever. As long as we don’t aggre­gate them.

In other words, I’m won­der­ing whether it’s pos­si­ble to keep “fit with regards to a given train­ing tuple” sep­a­rate, and use stan­dard multi-​​objective rank­ing meth­ods to dis­crim­i­nate mod­els that are dom­i­nated from those that are non-​​dominated, given a par­tic­u­lar train­ing set.

Mul­ti­cri­te­rion sort­ing is a par­tial order­ing rela­tion­ship, and that means there are often—especially in higher dimensions—mutually non­dom­i­nated points. But that doesn’t mean it’s dif­fer­ent from tra­di­tional scalar order­ing, which is just a degen­er­ate case: there are ties in races, after all.

So what does hold­ing back on aggre­ga­tion do?

Well, for one thing, it brings into focus the notion of broad and wide-​​ranging fam­i­lies of mod­els, not the scant hand­ful many sta­tis­ti­cians are used to work­ing with.

It leads to some inter­est­ing pos­si­bil­i­ties in under­stand­ing the rela­tion­ship between model per­for­mance over test data (gen­er­al­iza­tion) and train­ing data.

It leads to the use­ful notion of sheafs of mul­ti­ple non­dom­i­nated mod­els, some “spe­cial­ized” for mod­el­ing one train­ing point, oth­ers spe­cial­ized in mod­el­ing other data points, and a few gen­er­al­ists that do quite well on all of them.

It opens up some inter­est­ing ques­tions (to me, at least) about leave-​​one-​​out val­i­da­tion meth­ods. Espe­cially about how robust rank­ings might be.

Finally, it seems that it opens a door for the sort of “data bal­anc­ing” work Katya Vladislavl­eva has made such progress with… and maybe a direc­tion through which it can be com­mu­ni­cated to the folks on the machine learn­ing side.

We’ll see.