A lively breakfast conversation with old friend Cosma Shalizi this morning leaves me hanging on two or three technical confusions. Basically I buttonholed him and made him buy me breakfast today because I’m trying to understand his people better.
Statisticians, I mean.
In particular, I wanted to talk at him a while about what I see and hear in his student’s recent research project.
Some headway was made. But the time and effort spent translating from the [whatever the fuck it is I speak] to a series of relevant and semantically reasonable-sounding noises meant I didn’t quite get far enough to close all the open questions I had.
So I’m just going to dump them here. In more or less unedited form.
File this all under either “notes to self” or “cramped rants scribbled on ceiling of patient’s cell”, as you prefer.
For the prettiest ones
I didn’t get to a basic question about statistics that’s been developing in my practice lately. Viz: when are they necessary, and when are they even a good idea?
By “statistics” I don’t mean Statistics—the field—but rather the mathematical functions we use to summarize sets of observations: mean, variance, maximum, mode, range, and so forth.
Suppose you’re modeling some dataset containing example training points, using a model
Applying some particular model to the dataset produces a vector of predictions , which we want to compare to our expected responses . And given some other model , we’ll get a different vector of predicted values (one for each training input), .
Now normally we drop from this big honking e-dimensional space into one (1) dimension by doing some statistics at this point. Adding the absolute deviations up, or residual sum of squares: all those standard measures of model fitting. Those functions of random variables are “statistics”.
As I understand it from the sidelines, in the Statistics community (note the big ‘S’) there’s a lot of time spent working out modes and manners for model selection. These amount to comparison metrics which project these vectors I’ve described into various lower-dimensional spaces, with accompanying caveats about special conditions under which the assumptions underlying that given projection may fail, or be questionable. Particular kinds of models, particular biases in the datasets, and unusual interactions between those two.
Now suppose for the sake of building up slowly in this argument, we have one (1) input:output pair in our training set; one vector of values, in a tuple with a single expected value. Given two models and , if I plug those values in I’ll get predictions and .
Clearly, the best of these two models the one whose distance from the expected is smallest. In one dimension, it doesn’t really matter to us which of the metric norms you prefer; you can use absolute difference and be intuitive about “distance in one dimension”, or square the distances if you want to [prematurely] plan ahead for higher-dimensional cases.
In this trivial case with one training point, let’s just say we pick norm—absolute value—and we’ll stick with it as we move up the dimensions. OK?
Now suppose we have two (2) input:output pairs in our training set; two tuples of values and accompanying expected values. Given our two models and , when I plug those two sets of values in I’ll get predictions and .
Let’s just dismiss Doktor Gauss for a while. Send him out of the room.
When I say that dominates under some metric against , I mean that for every training case in our dataset, and for at least one training case, .
Notice that for any norm, the order of observations doesn’t actually change. We’re not aggregating them over all training cases, we’re producing vectors that can be stretched or squared—as long as you do it independently across training data—without changing the domination relationship. If some model dominates some given the training data, it doesn’t matter whether we square the residuals, or take their absolute values, or whatever. As long as we don’t aggregate them.
In other words, I’m wondering whether it’s possible to keep “fit with regards to a given training tuple” separate, and use standard multi-objective ranking methods to discriminate models that are dominated from those that are non-dominated, given a particular training set.
Multicriterion sorting is a partial ordering relationship, and that means there are often—especially in higher dimensions—mutually nondominated points. But that doesn’t mean it’s different from traditional scalar ordering, which is just a degenerate case: there are ties in races, after all.
So what does holding back on aggregation do?
Well, for one thing, it brings into focus the notion of broad and wide-ranging families of models, not the scant handful many statisticians are used to working with.
It leads to some interesting possibilities in understanding the relationship between model performance over test data (generalization) and training data.
It leads to the useful notion of sheafs of multiple nondominated models, some “specialized” for modeling one training point, others specialized in modeling other data points, and a few generalists that do quite well on all of them.
It opens up some interesting questions (to me, at least) about leave-one-out validation methods. Especially about how robust rankings might be.
Finally, it seems that it opens a door for the sort of “data balancing” work Katya Vladislavleva has made such progress with… and maybe a direction through which it can be communicated to the folks on the machine learning side.