I spend a lot of time working in a field other people call “Genetic Programming”, though I prefer to call what I do “not relying quite so much on received wisdom and dogmatic habits regarding data and modeling.” “Genetic Programming” is admittedly shorter; then again, every real project has multiple objectives.
And that’s a nerdy in-joke right there. It’s about multiple-objective (or multi-criterion) search. I’ll tell you about it another day. Every time I make a nerdy joke, or you hear a nerdy joke, that you don’t understand, you write it down and I’ll tell you what it means. Then you memorize it and soon you’ll have a decent armamentarium of nerdy jokes. Oh, the adventure of molding little new nerds!
So anyway, the distinction between the names we use is not specious. Very Smart People do Genetic Programming. I am stupid, and also a fool, and forgetful, and have some kind of amnesiac fault in my head that makes it so whenever I do an experiment I somehow forget all the stuff that normal people are able to assume is true about the system and their tools and their habits and their data.
This is a serious handicap. It means I am forced to constantly peek at things everybody else is able to take for granted. I plot charts of measures that nobody else bothers to plot; I write tests for code that everybody knows will just work; I refuse to trust other people’s libraries if they don’t have tests because me, I am so dumb; I emit a series of plaintive “Why?” noises whenever I read other people’s preprints or published papers. And lord help the speaker at any conference I attend, me with my “two questions” I always ask after every goddamned one of them when all they really need is a drink and a chair.
Poor Bill. You’ll have to forgive him—he has a condition.
To the point: In the field of Genetic Algorithms, and by (specious) descent in Genetic Programming and more generally in Machine Learning, the Very Smart People have some habits I’m unable to save in my faulty long-term habit cache.
One of the most common is The Plot of the Best So Far.
In Machine Learning projects (including Genetic Algorithms and Genetic Programming ones), what you often have when you start is just a bunch of examples of something. Then you run a program that takes those examples as inputs and chugs away and finally poops out a model of the structure:function relations to be found within those examples. The algorithm does magic thinking stuff, and it knows what you want, and it looks at the data for you. For cultural reasons Machine Learning (and GA/GP) folks talk about this process as “search” rather than “guessing a lot of times”, but it is nonetheless a fact that most algorithms really are just generating a series of increasingly biased guesses.
With few exceptions, Machine Learning (and GA/GP) algorithms start with a couple of dumb guesses, then “look” at those and try to “learn” how to make more cunning guesses over time. They guess and guess and guess, until eventually they either start being boring, or they “prove” in some very complicated but authoritative way that the final guess is the absolute best that could ever be found.
As a side effect of this iterative process, they produce not only a series of guesses, but of necessity those guesses are associated with a series of scores. No matter what they’re up to or when they quit, Machine Learning algorithms are pretty much all about moving away from the bad-scoring initial dumb guesses, and towards the better-scoring sort of well-considered and elegant and refined models.
Bear with me. Because of my persistent troubles, I have to remind myself of this.
So anyway, this is how Machine Learning works: Very Smart People who don’t have my problem are handed a big pile of examples collected by a “domain expert”, and they know already what it is they want to do for models. (I feel like a blind man explaining color when I say they “know already”.) And then they just download and run and report the elegant and refined models they get at the end of running a “search”. I mean they start a computer running, and move the laptop off their legs so their thighs don’t burn while it’s guessing, and when the guessing is done they will have a very nice model of the examples, that has a good score. This very nice model is handed to the domain expert whose examples were dumped into the hopper. In my experience, the domain expert is typically very happy to have something more explanatory than the huge pile of virtual post-it notes and dumb guesses they started with.
Very Smart People who write Machine Learning algorithms go through a bit more effort. For them, the motivation for writing new Machine Learning algorithms is to show that the series of guesses produced by an old, obsolete algorithm any fool can download as an R library won’t get to the elegant and refined scores as fast as the new one they thought up. So Very Smart People who write Machine Learning algorithms like to run a lot of horse races.
What’s a horse race? Well, first you need a pile of examples from a domain expert. Then you take Bad Old Machine Learning Algorithm Number One, and you start it running, and instead of walking away and waiting for it to be done, you write down the list of scores for the guesses it produces along the way. And then you do the same for Excellent New Machine Learning Algorithm Which You Wrote. And you show that ENMLAWYW is consistently better at reaching the best-sounding scores, or gets better faster, or more often gets better, or maybe has more something else than BOMLANO. And then you publish a paper!
Sucky old BOMLANO. Nobody uses it any more.
Now it’s fine to make the claim; that will almost always get you where you want to be. But in papers, it’s important to show and tell. So you need to make a drawing of some sort, the kind that fits into a two-column layout in a four-page paper without getting too squished. Alas, many Machine Learning algorithms (even that crap BOMLANO) produce a huge number of scores as they guess towards success, and plotting those is messy and confusing.
So in your time-series plots of the horse race, you show the Best So Far. For several reasons: Plotting 50000 points or so is messy and confusing, especially in a 5-cm square. To be honest most people don’t really care about your thing or what you claim to do; they just want to know you algorithm isn’t worse than BOMLANO (which they learned in graduate school so it has a special place in their habit caches), and so for them best at the end is important—hell, if they’re domain experts, they only want one model after all. But best at the end is a tricky thing, and really you want to show one of those other things I mentioned: ENMLAWYW gets better faster than BOMLANO, or ENMLAWYW gets higher scores than BOMLANO, or ENMLAWYW isn’t as good as often as BOMLANO.
Notice that nobody really cares, beyond a few cranks like me, about anything but the One Final Answer a Machine Learning algorithm provides. Some rare advanced engineering professors might sometimes teach a seminar in multi-objective search, which (I said I’d explain) usually return a whole bunch of high-scoring things that are different from one another but “tied” in the sense the domain expert cares about. But even that is the final answer: it’s just a lot of them.
It’s stupid to care about what happens before an algorithm is done. Everybody knows that stuff.
Woe for me. I’m a sick and confused man because of my habit problem. It makes me look. Not only when I write a new algorithm, but whenever I run one. Not only at the best scores so far, but at every score of every guess. Because I just fail every damned time at assuming I know what’s happening in there.
And every time I look, I am confused by what I see. All kinds of stuff, usually.
Here’s an example from yesterday.
I’ve been looking into some bioinformatics data from Jason Moore’s lab. The pile of examples here are SNP genotype data collected from sick and healthy people, and the models one wants are supposed to “explain” the relation between those folks’ genomes and disease state. I’m initially representing the models using a little domain-specific language I copied from some software they use, though I called the models “Snip scripts” for reasons that aren’t important except to explain the plot title.
Now a fully able Genetic Programming practitioner (and surely most Machine Learning ones too) would be able to look at that problem, and say “Aha!” or whatever it is they say, and boot up R and make it go and not burn their knees, and they’d print the best model out on some kind of dot matrix printer (I suppose for dramatic effect) and tear it off and hand it to Jason. And that would do.
But here’s what I do. I worry about how much better one kind of guessing (call it “dumb”) is, compared to another kind of guessing (call it “machine learning”). So I have to check and see.
Here’s what you get, score-wise, when you make 100000 guesses. That is, in this dopey handicapped algorithm, I picked 100000 random Snip scripts, scored them according to Jason’s instructions, and plotted them as a time-series of eeny weeny little Xs. The scores range (in theory) between 0.5 and 1.0. (Don’t ask about the extra space on the y axis, OK? It’s for something else.)
This is just the sort of thing I like to see. I’m so skittish, I need the reassurance that dumb guessing won’t work on a complicated problem like this. And I also like to have some kind of “compared to what?” distribution to compare against, when I do fancy machine learning things, and there you got one. The distribution is: pretty much somewhere between 0.5 and 0.6 is all you’re gonna get. There’s also a little bit of structure in there, like maybe some kind of horizontal striping, but that might just be rounding. Nothing fretful at all.
But I’m still a skittish fellow with a habit problem, and like I said I don’t really “do” Genetic Programming so much as poke around in the world of data and models. Genetic Programming as such is actually kind of complicated and Microsoft Wordish, to be honest: over the years Very Smart People who write new algorithms have thrown a lot of junk in there, in the spirit of “biological inspiration” or something like that. They wrote a lot of papers, so they’re pretty good, whereas I’ve written damn all for papers, so clearly I need to take it slow.
Now Genetic Programming is as a King among Metaheuristics, and Dumb Guessing is a lowly ant in comparison. Being skittish, I decide yesterday to make a little step and see what happens when we approach the noble Genetic Programming. So I try hill-climbing.
Now hill-climbing is a time-honored straw man of a Machine Learning algorithm. If you can’t figure out the algorithm from the name, let me tell it to you right now: make a random guess, then make another random guess and keep the new one if it’s at least as good as the first one. If you want to see dumb hill-climbing in action, look at the plot I’ve already posted, and imagine that instead of throwing every guess away when I make a new one, I kept the best so far.
Him again.
If you peer at the guessing plot, you can see I could maybe have accidentally found a best-so-far score of about 0.6 after 20000 guesses or so, and pretty much not improved that. Dumb hill-climbing is pretty dumb.
So yesterday I think I can check off “dumb hill-climbing” as completely entailed by my “dumb guessing” plot, and I have to ask: What’s my next step towards King Genetic Programming?
That would be “not-so-dumb hill-climbing”. Basically the only difference between “dumb” and “not-so-dumb” is the way I generate new guesses (and that phrase is in itself a deep lesson in Machine Learning you should really take to heart): instead of replacing the entire guess, I’ll use what I already “know” to make an “informed” guess.
In this case, what I did was take these Snip scripts, which are strings of tokens, and replace some of the tokens with new randomly-selected ones. For reasons I don’t need to explain here, any string of Snip language tokens is a valid model, whether or not it’s a good or bad scoring one.
So I’ve got my next tentative step in metaheuristic space towards Genetic Programming (all kneel), which is this: I make a dumb guess, and I score it, and then I make a second one where I change some proportion p of the old script’s tokens to new ones, and I score that, and I keep the new one if its score is no worse than the old one’s.
Ah, but, but… what proportion p should I use? Crap, I have no damned idea—damn this broken brain! If the number is too high, I’m changing every token so it’s back to dumb guessing; if the number is too low, I’m not changing hardly any tokens, and it’s like even dumber guessing where I keep looking at the same guess over and over. Ummm, somewhere in between?
Now there are Very Smart People who have explored this at length, but like I said I find I can’t read their papers without emitting a series of “Why?!” noises, so to avoid waking up the dog from his nap I just tried a bunch. Like, all of them. I just started with a high number (0.5), and cut it in half every once in a while until it got down to pretty much nothing, and then popped it back up to 0.5 again.
Do I know what’s happening? I do not.
So the I look at some runs. Here’s one. Ignore the little tracy lines on the bottom half, which I won’t bore you by explaining.
The eeny weeny “X” marks are once again the scores of every new guess I make. From left to right I’m making a guess, and then making a new guess based on that, and keeping the new guess if it’s at least as high-scoring as the old one, or the old one otherwise. And because I don’t know what mutation rate (which is technically controlling “uniform mutation”, meaning p indicates the chance that I replace any given token in the old script with a new random token) I start it ridiculously high, and gradually drop it to 0.0, and then pop it back up again to a high number. That’s the “oscillation” you can see over time, even though I don’t trace the mutation rate itself on this plot.
You can see when the mutation rate is high, because the distribution of “new mutant guesses” looks an awful lot like the distribution we see in the “dumb guesses” plot we already talked about. And you can see when it’s small, because the distribution of new guesses is pretty much “even dumber” by picking the same damned script over and over—the best I’ve seen so far—so the little Xs cluster up near the plateaus.
But in between, stuff happens.
Now if I weren’t a handicapped fellow, I would have been content to plot the best scores so far. I’d watch the trace pop up to 0.6 or so around 5000 guesses, and I’d be like “hey!”. I’d grin and watch the best score so far pop up again to 0.7! around 10000 guesses, and I’d be like “whoa!”. And I’d see it hop all the way up to around 0.8 and I’d be like “omg!” and I’d print it out on the Okidata and tear it off with a flourish and hand it to the domain expert with a little dusting off of my hands to indicate how well I know it’s a job well done.
But I accidentally plotted every score, not just the best ones. Well, no more accidentally than I accidentally wrote this long-winded exposition: it’s what I am forced to do by my faults. It’s not a choice: I was born this way.
As a result, I am given some puzzles. Not just puzzles, because there are the things I’ve already pointed out to you, like: I can infer something from this plot of everything that you would never see if you just looked at the maximum score over time. I can explain something about the effect of mutation rate on the ability of this guessing process to find improvements.
But there are puzzles, too.
Do you notice, like I do, that there are an awful lot of different high-scoring X marks, ones that are well outside the expected distribution of scores reached by “dumb guessing”? And do you also see that they only appear after a new best score has appeared?
Do you see the horizontal banding, over time, down in the middle of all those scores? Remember that those are still random variations of the “current best”. Somehow random variations tend to be extremely biased, score-wise, and produce quite similar functions even though their structures must be substantially different.
Do you see something qualitatively change, around 25000 guesses? There’s something different about the way the scores are clumped up around the best so far, and in a few other clumps somewhat below that plateau. There’s also something subtle about the way the plateau itself is composed, as though there were a lot of little incremental improvements over the last 20000 guesses.
Neither of us would have seen those details, if I’d plotted “best score so far”. If I were a real, non-handicapped Machine Learning practitioner (note I don’t say Very Smart Person, because that’d be pretentious), that would be fine: my domain expert customers would be interested in the one best-scoring answer at the end, not that wobbly stuff in the middle; my paper-reading customers would be interested in the time-course of best-scoring answers compared to the well-known BOMLANO benchmark, not all that not-quite-so-good junk that muddies the issue.
Hell, I only showed you one of five runs I did. The others look different. Did I do those other runs Because Stochasticity, and I want to be extra-fancy and report a variance? To make the case for my pronouncement of the Best Model Evar?
No chance; that’s the Very Smart way to work. I did five because I don’t even trust that one example is enough to show me everything I might stumble over. I’m a skittish fumble-thumbs when it comes to this stuff, and I want to know what actually is going to happen before I stride on into using Canned Algorithms (aka, “where angels fear to tread).
And I burned my knees, too, by the way. Seriously. This is a “notebook computer”, not a “laptop”, and technically its CPU is not supposed to reach its 180°C sitting on your thighs.
So here’s the interesting thing, to me: What’s hidden away in all that not-quite-so-good stuff? That is, what are the almost-as-good models, and what can I learn from them about the system this data represents?
Very Smart People have written, authoritatively, that the strength of modern Machine Learning algorithms is the degree to which they tacitly embody that knowledge as they efficiently capture information in the examples. Genetic Programming uses the Power of Biological Inspiration; other Machine Learning algorithms use other powers, like the Powers of Definite Convergence and Ergodic Coverage and Gradient Descent to capture the same stuff. They’ve shown this through mathematics (based on only a very few assumptions), and through experimentation (based on a huge number of benchmarks). They trust one another. This stuff is built right into R libraries and Mathematica; it’s no longer subject to question, honestly.
I’m not that smart. So I have to look. And as result I get distracted: Isn’t it interesting that there is so much hidden in that unexplored detail? Who but the broken shall attend to the otherwise unremarked?
This is the sort of thing you’ll get told does not substantively advance the field. You can’t write a paper about it, or even give a talk at a workshop. The field has moved on, way far past this sort of poking, on to other much Smarter things.
Which is one reason why I incessantly claim to have no field. Me? I just answer questions for people, and we explore together. I don’t have the right to tell them stuff.
Also, I have misplaced the dramatic dot matrix printer.
[
[