Search algorithms

Cross-​​posted to the Nudge blog

There are really three basic genetic pro­gram­ming algo­rithms I like to use out of the box. One’s stu­pid, one’s (almost) stan­dard, and one’s pretty fancy and effective.

Start with one nugget of assump­tion: Every search algo­rithm should be multi-​​objective. Period. No excep­tions. You do a single-​​objective search on any real-​​world prob­lem, you’re either lazy or a liar. The end, no excep­tions. I don’t care if you’re an Oper­a­tions Research guru or a lone stock trader look­ing to make a buck, no prob­lem is single-​​objective.

So when I say “bet­ter” in any con­text, I am talk­ing about Pareto dom­i­na­tion, or if you’re an econ­o­mist kinda girl, “rel­a­tive Pareto effi­ciency”. Say you have two alter­na­tives, A and B, and you’re decid­ing between them on the basis of two min­i­miza­tion objec­tives, w1 and w2. A is dom­i­nated by B if w1(B)<w1(A), and w2(B)<w2(A). B is dom­i­nated by A if w1(A)<w1(B), and w2(A)<w2(B). Nei­ther one dom­i­nates the other unless it is strictly bet­ter on all com­par­isons. Oth­er­wise, A and B are non­dom­i­nated by one another; they can’t be dif­fer­en­ti­ated in a mean­ing­ful way with­out say­ing which objec­tive is “more important”.

You want to sur­vive a car crash, or do you want to drive really really fast? Which is bet­ter? So, get that into your head. There is never one objec­tive. Never.

Let’s make sure we under­stand why there are always mul­ti­ple objec­tives in GP runs. You want accu­racy, right? But you also want par­si­mony. You’re evolv­ing mod­els, solu­tions to prob­lems, func­tions, struc­tures. Things can get hairy very quickly in a run­away evo­lu­tion­ary world: pro­grams can get super­sti­tious, they can inherit weird cultish quirks, like learn­ing to use “Cosine(0)” or Sqrt(Sqrt(Sqrt(Sqrt(Sqrt(8.9912)))) instead of “1”, if some for­got­ten ances­tor hap­pened to find that first and passed it on as Big Math Juju. So as God of Your Lit­tle Engi­neer­ing Prob­lem trust me: you hon­estly want to reach down out of the clouds and say, “Hey, guys, chill. Remem­ber, we use our human-​​readable voice when we model, OK? Two com­mand­ments here: accu­rate, and succinct.”

So there’s an intrin­sic rea­son right there why you want to min­i­mize model error and model com­plex­ity. At least two objec­tives. Always.

Because there’s always a less effi­cient way to solve any prob­lem with­out sac­ri­fic­ing accuracy.

Any­way, those search algorithms.

Stu­pid” is ran­dom search:

  1. Cre­ate a ran­dom pro­gram, and eval­u­ate it
  2. Cre­ate a new ran­dom pro­gram, and eval­u­ate that
  3. Throw away the one that’s dom­i­nated, or the newer one if nei­ther is dominated
  4. Go to 2

Seem so ridicu­lous it’s not worth con­sid­er­ing? Not really. It’s essen­tially the worst you could pos­si­bly do, know­ing noth­ing about your search prob­lem. So if noth­ing else, it pro­vides a use­ful base­line, as well as a nice test of your ran­dom­iza­tion meth­ods, eval­u­a­tor tim­ing, and other infra­struc­ture that almost any search algo­rithm will share with this dumb-​​as-​​a-​​sack-​​of-​​hammers approach.

Plus: Easy to explain? Damn straight. Always run ran­dom search on your prob­lem, with your rep­re­sen­ta­tion, full-​​fledged eval­u­a­tion, and data col­lec­tion sys­tem in place.

(Almost) stan­dard”? Steady-​​state Pareto tour­na­ment selec­tion works for me for a num­ber of rea­sons, at least as a first approach:

  1. Cre­ate a pop­u­la­tion of, oh I don’t know, 500 ran­dom programs
  2. Select a tour­na­ment of maybe, ummm, 10? yeah 10 pro­grams from the pop­u­la­tion, uni­formly at ran­dom, and eval­u­ate them
  3. Iden­tify the non­dom­i­nated “best” pro­grams among them, and the dom­i­nated not-​​so-​​best ones, and split them into two groups
  4. If there are no dom­i­nated ones, pick one of the “best” pool at ran­dom and demote it out of spite. If there’s only one “best” pro­gram, it’s lonely, and inbreed­ing is bad, so pro­mote one of the lesser ones out of pity.
  5. Replace all the pro­grams in the not-​​so-​​best pool with off­spring of the non­dom­i­nated “best” pool
    • Pick two par­ents, with replace­ment, from the non­dom­i­nated set
    • Use one-​​point crossover to make offspring
    • Mutate the babies slightly, here and there, to taste
  6. go to 2

OK, I con­fess. This is stan­dard in almost no ways what­so­ever. It’s a “stan­dard” toolkit for me, sure, but it’s a pretty far cry from Koza’s clas­sic lock­step population-​​based meth­ods, the kind every­body and their brother has copied whole­sale. I like it any­way: it gives rapid feed­back on progress, it’s kindof elit­ist (except for that “spite” step), it saves a lot of search time by lim­it­ing expen­sive mul­ti­ob­jec­tive com­par­isons to rel­a­tively small pools, it works with fit­ness caching. It’s not per­fect, but I already linked to the NFL Wikipedia page, so what are you, so dumb you can’t read? Nothing’s perfect.

Fancy and effec­tive? I really like the Pare­toGP method from Mark Kotanchek and Guido Smits (and their colleagues):

  1. Cre­ate an Archive of 500 ran­dom programs
  2. Cre­ate a Pop­u­la­tion of 500 ran­dom programs
  3. Cre­ate a con­tainer for the Next Gen­er­a­tion (an empty list; hold onto that)
  4. Select a tour­na­ment of 20 Archive pro­grams with uni­form prob­a­bil­ity, and eval­u­ate them, and for­get about the dom­i­nated ones. Just make a list of the non­dom­i­nated ones from the Archive tournament.
  5. Select a tour­na­ment of 20 Pop­u­la­tion pro­grams with uni­form prob­a­bil­ity, and eval­u­ate them, and for­get about the dom­i­nated ones. Just keep the sub­set of non­dom­i­nated ones from the Pop­u­la­tion tournament.
  6. For each Pop­u­la­tion tour­na­ment win­ner you got from step 5, pick an Archive tour­na­ment win­ner you kept from step 4, at ran­dom. Cross them over to cre­ate an off­spring, maybe with some muta­tion if you must.
  7. Stick the baby into the Next Gen­er­a­tion. If you have 500 pro­grams in the Next Gen­er­a­tion, replace the Pop­u­la­tion with it and go to 3. If you’ve run out of tour­na­ment win­ners before fill­ing the Next Gen­er­a­tion, just go to 4. If you choose to attack the Orc, turn to page 88.
  8. You’re going to be won­der­ing when you stop cycling the Next Gen­er­a­tions back into the Pop­u­la­tion, aren’t you? Oh, I dunno. Pick a num­ber. Say “10”. Go on, say it. There: do it ten times. Or more. Or less. What­ever; it’ll work.
  9. Aha! But then: once you’ve done your ten renewals of the Pop­u­la­tion (what Kotanchek and such call a “cas­cade”), you’re not done. Oh, no, boyo. You get your butt up there and you start the whole thing, all over again at step 2. That’s right. Another cas­cade, from another new, ran­dom Pop­u­la­tion, and another ten gen­er­a­tions or so. But—but—before you do, take the final Pop­u­la­tion, the best best best of that cas­cade you just fin­ished, and you stick it into the Archive. Add it to the pro­grams that were already there, and take the time to trim the Archive back down to 500 pro­grams. You can do a full-​​fledged 1000-​​wise Pareto tour­na­ment if you have the time, or you could do some other cun­ning selec­tion method if you want. Often I tend to do some­thing quick and dirty, like just pick­ing tour­na­ments of 20 and culling all the dom­i­nated ones until the Archive size drops back down to 500. So, any­way, you were going back to Step 2 to run another cascade.
  10. Do that until you’re bored.

Now this is a close approx­i­ma­tion of the Pare­toGP algo­rithm described in a num­ber of papers by Mark Kotanchek, Guido Smits and Katya Vladislavl­eva. It’s a broadly cus­tomiz­able frame­work, and if it seems like a lot of extra work com­pared with the “sim­pler” GP algo­rithms described above… well, that may be true. I like it very much because it exposes exploita­tion (via the Archive) and explo­ration (via the Pop­u­la­tion) as almost sep­a­ra­ble processes. You feel like things are get­ting stuck, let the Pop­u­la­tion grow a bit and do more ran­dom restart­ing and exploratory search for weird alter­na­tives; you feel like things are get­ting too wild, you do fewer, shorter cas­cades and keep a big­ger Archive, really nail down the vari­a­tions on the already-​​discovered themes you’ve col­lected there. Play around, fid­dle the knobs, see what you get.

And that’s the point, of course. There is no “best” search algo­rithm. Don’t make me link to that paper again.

You have to pay atten­tion. You have to bring to bear a vast com­bi­na­to­r­ial arma­men­tar­ium. You have to keep a seven-​​drawer tool­box in the vir­tual garage, with a hun­dred dif­fer­ent wrenches in it, and then look at your prob­lem and lis­ten to your algo­rithm purring along, and reach into a drawer with­out look­ing and find just the right tool, and give that algo­rithm a lit­tle tight­en­ing here, a lit­tle shimmy there, grind a bit off the edge, take it apart and put it together again slightly bet­ter. With grease.

See, it’s not about solv­ing your prob­lem as fast as pos­si­ble. It’s about watch­ing it improve, and hav­ing the sense to see it improv­ing. About hav­ing the tools on hand and the expe­ri­ence under your hat so you can inter­vene in a rea­son­able and effec­tive way.

So, yes: there are other ways. None are best.

16 thoughts on “Search algorithms

  1. A really cool sum­mary — one of the best blog posts I’ve ever seen about GP, to be honest.

    Some recent work by Stephen Dignum and Ric­cardo Poli shows that things like size lim­its actively speed up bloat — the aver­age pro­gram sizes grow faster with lim­its, they just stop grow­ing when they hit the limit. Do you know if anyone’s looked at the impact of the kind of pareto approaches you’re sug­gest­ing? My guess is that they wouldn’t have that affect (which would be cool), but no promises.

  2. Bah. Don’t flat­ter me. And let’s take the dis­cus­sion to the Nudge blog. Post future dis­cus­sions there, any­way; the guys would all love to chat.

    Seems to me you should talk with Guido and Katya about that ques­tion. I have a vague mem­ory of her pos­si­bly doing some real exper­i­ments with Pare­toGP size dynam­ics for her the­sis. And I have vivid mem­o­ries of Guido dis­miss­ing bloat out of hand, since it “never hap­pens if you just main­tain model com­plex­ity as a dis­tinct objec­tive.” I know they focus 99% on sym­bolic regres­sion, so they’re firmly in your the­o­rists’ space of Koza-​​style trees.

    But you know all that, of course, hav­ing spent time sit­ting and talk­ing with Kotanchek, Smits and all. So I’m not sure I fol­low your ques­tion. You ask­ing whether it works?

    I’d imag­ine the­ory is almost triv­ial: The expected off­spring size change after crossover must surely decrease if there’s been selec­tion pres­sure for smaller par­ents. Doesn’t that make sense?

    Fur­ther, as the pop­u­la­tion matures there are small, high-​​error par­ents and large, low-​​error par­ents in the pop­u­la­tion (look at any of Kotanchek’s & Smits’s graphs, say in GPTP IV). Under those cir­cum­stances, I can’t see how crossover can ever exceed a cer­tain steady-​​state limit. It’s just a mat­ter of the innate ten­dency of unequal crossover to grow big­ger code trees, off­set by the innate selec­tion for smaller ones.

    Or are you ask­ing some­thing else I’m missing?

    In any case, it’s worth a look. You wanna do it? We’ll help.

  3. Excel­lent and very inter­est­ing read. I’ll be sure to try that out when I get the opportunity.

    A few months ago I tried solv­ing some small prob­lem with what I remem­bered about GA’s and had no luck. After read­ing your arti­cle, I’ll be smarter the next time around. It really has been quite some time since I read about the subject.

    Also, good read­ing links. You’ve been added to my rss reader :)

  4. I’ve never been a fan of multi-​​objective func­tion­als. In fact, I would say they are some­what mean­ing­less. How does one deter­mine the weight­ing between any par­tic­u­lar objec­tive (usu­ally com­pletely sub­jec­tive)? This is sim­i­lar to adding apples and oranges. For a cost func­tional to make sense it must be com­posed of com­mon units.

  5. Adan,

    Mul­ti­ob­jec­tive opti­miza­tion is, indeed, sub­jec­tive. That’s the point. Any idiot who lin­earizes two objec­tives, sets up an affine weight­ing scheme that says “0.8 safety + 0.2 speed”, they’re… well, they’re an idiot. They’re step­ping on the domain of the decision-​​maker: (a) they’re pre­sup­pos­ing that the objec­tives are rec­on­cil­able (mean­ing, the solu­tion set is con­vex, not con­cave) and (b) they’re pre­sum­ing that exter­nal­i­ties not cap­tured by those apples and oranges won’t skew things in another direc­tion. Many’s the time when a customer/​decision-​​maker has said , “Give me all the opti­mal solu­tions, and I’ll pick the best one.” That’s not dumb on their part, it’s dumb on your part if you pre­ma­turely limit their choices.

    Best always to keep all objec­tives orthog­o­nal, and let the decision-​​maker take her pick.

    So, in sum: You don’t deter­mine weight­ing. Weight­ing is not your job, it’s a domain-​​specific expert’s job. Odds are any affine weight­ing or even non­lin­ear com­bi­na­tion func­tional is in prac­tice a lie or a mis­step every time it’s assigned before search­ing. Weight after you search; let the decision-​​maker con­sider all the Pareto-​​optimal solu­tions, and take their pick.

    Try it my way first, and see what clients—actual peo­ple mak­ing actual decisions—think. And maybe some of the things you’re assum­ing about mul­ti­ob­jec­tive opti­miza­tion are mis­ap­pre­hen­sions of the ter­mi­nol­ogy, the method­ol­ogy, and the intent. If a cus­tomer wants to opti­mize for both apples and oranges, then why insist they do the “sim­pler” thing and pick one, or make up some stu­pid fake thing that sup­pos­edly com­bines them? Why not just do the rel­a­tively easy math, use the cor­rect algo­rithm for the prob­lem at hand, and opti­mize for both objec­tives simultaneously?

    I have a strong feel­ing your answer will involve fic­tions like “tractabil­ity” and “stan­dard algo­rithms”. Which are, no mat­ter how impor­tant one’s depart­ment is in a field, lazy BS.

    That’s my opin­ion, of course. You’re wel­come to carry on how­ever you feel is appro­pri­ate for your cus­tomers’ needs. :)

  6. lorg,

    I wouldn’t call it an “arti­cle”. It’s a post in a ram­bling story that’s really hap­pen­ing over here. We’re build­ing a[nother] open-​​source genetic pro­gram­ming sys­tem, and hope­fully one that’s not as stu­pid as all the rest in the world, and maybe as use­ful even­tu­ally as the good ones like ECJ.

    Point of order, though: Genetic Pro­gram­ming is not just a genetic algo­rithm. Genetic Pro­gram­ming is search through a set of algo­rithms, com­plex gram­mat­i­cal struc­tures, or func­tions, not search through a para­me­ter space. The “genetic” part of the title is essen­tially a red her­ring, too: what I described above is ran­dom search, and a cou­ple of population-​​based meta­heuris­tics that are some­what kindof like GAs, but not really.

    Have a look at Ricardo, Bill and Nic’s book [a free down­load], for more infor­ma­tion on GP vs. GAs.

  7. My posi­tion remains that advo­cat­ing one way to per­form all mul­ti­ple objec­tive opti­mi­sa­tion prob­lems is a gross sim­pli­fi­ca­tion. It is bet­ter to under­stand what the deci­sion maker wants and allow that to guide our prob­lem solv­ing approach.

    From http://​www​.jstor​.org/​p​s​s​/​2​5​8​1​556:
    A Pri­ori Pref­er­ence Artic­u­la­tion: Com­bines all objec­tives into a sin­gle objec­tive through the use of an aggre­ga­tion func­tion, turn­ing the multi-​​objective prob­lem into a single-​​objective prob­lem to search.
    Pro­gres­sive Pref­er­ence Artic­u­la­tion: May have par­tial pref­er­ence infor­ma­tion, and this pref­er­ence infor­ma­tion is adjusted as the search con­tin­ues by inter­pret­ing the results of the search.
    A Pos­te­ri­ori Pref­er­ence Artic­u­la­tion: No pref­er­ence infor­ma­tion, instead the deci­sion maker is pre­sented with a set of can­di­date solu­tions (gen­er­ated by some search process) to choose from.

    If we know that the deci­sion maker wants a spe­cific trade-​​off it would be use­less to search the extent of the Pareto front (wasted com­pu­ta­tion). Sim­i­larly if we don’t know what the deci­sion maker wants it would be silly to sug­gest one spe­cific trade-​​off as being ‘optimal’.

  8. Nobody ever said any­thing about enu­mer­at­ing the entirety of the Pareto front. And did I say the decision-​​maker has no right to min­i­mize com­pu­ta­tion effort (or clock time spent) as an objective?

    My posi­tion remains” is what’s throw­ing me off; did I come across as try­ing to con­vince you of some­thing? Or are you chat­ting with Adan?

    In any case, you’re say­ing things I agree with entirely, in a tone I’m not sure I get. Yes, you’re right. On all counts.

  9. Pingback: things to look at (april 1st - april 8th) | stimulant - changing things around. . .

  10. Pingback: Short Notes: Quantitative Hedge Funds, Google App Engine, DTH, itimes « Blue Screen Of Duds

  11. Hi Bill,

    I think your com­ment about convexity/​concavity of the solu­tion set is a bit of the mark. Using MDL type of argu­ments, you can actu­ally show that the front of non-​​dominated solu­tions for ‘log-​​likelihood’ type of error func­tions is nec­es­sar­ily con­vex. I.e., you will try to minimize

    error + con­stant * complexity

    or alter­na­tively (as a like­li­hood function):

    p(D | M) * P(M)

    (where ‘con­stant’ is hid­den as an expo­nent in the size prior in P(M)).

    Of course, fix­ing the con­stant before the run is no solu­tion as you state, you need a multi-​​objective search to find the trade-​​off. How­ever, strict Pareto dom­i­nance is a bit of overkill in this sit­u­a­tion, as you can use con­vex­ity to cull the solu­tion space.

    For the sake of argu­ment, assume that we use ‘speed’ instead of ‘size’ as a mea­sure of model com­plex­ity. Apart from this being a more nat­ural mea­sure, it also allows one to high­light the convexity.

    Sup­pose one has two solu­tion, each run­ning at a dif­fer­ent speed. They also have a dif­fer­ent error (MSE). The slower solu­tion has a smaller error than the faster solu­tion (they are non-​​dominated).

    Now cre­ate a ‘com­bined model’, which selects a model with a prob­a­bil­ity for the fit­ness cases. Flip a biased coin using this prob­a­bil­ity and select the first or sec­ond model based on this out­come. This means that each solu­tion will be used ran­domly, biased by the coin.

    Then, the error/​speed trade-​​off of this model will nec­es­sar­ily lie on the line between the two orig­i­nal mod­els: it’s error is a weighted aver­age of the errors of the two orig­i­nal mod­els, while it’s speed is the weighted aver­age of the speeds of the two orig­i­nal mod­els (plus a small con­stant for toss­ing the coin. This con­stant we ignore.).

    Hence, any two non-​​dominated mod­els dom­i­nate the entire area that lies above the line spanned by the com­bined model, and the solu­tion set is con­vex. The pareto set that is usu­ally used is larger than the true solu­tion set.

    (Of course, this whole argu­ment depends on the fact that we assume that the ‘size’ of a model is com­men­su­rable with the ‘com­plex­ity’ of a model. In gen­eral, Kol­mogorov dis­agrees. The speed argu­ment how­ever holds.)

  12. Good points, Maarten. True enough when we’re talk­ing about offline search and opti­miza­tion, and when the con­straint set is rel­a­tively tractable.

    In work­ing on this project, and writ­ing these posts, I’ve come to real­ize that my real com­plaint is that most of offline opti­miza­tion is short-​​sighted. Rude to a real decision-​​maker. Inag­ile, and fre­quently maladaptive.

    A lot of my rants through the years, where I ask peo­ple why they did some­thing dumb with mul­ti­ob­jec­tive search, are not really about the con­vex­ity struc­ture of solu­tion spaces, or even about pre­ma­ture con­ver­gence to mis­lead­ing sub­sets of pseudo-​​optima.

    They’re actu­ally ques­tions about evi­dence. Prac­tices. Whether their deci­sion was informed by the actual struc­ture of the actual solu­tion space being explored, or if they just did it because it’s the way they learned in school, or because nobody com­plained before, or because it never broke in ear­lier prob­lems. Or because it’s eas­ier that way.

    And then I ask: Who told them “eas­ier” was an objec­tive of their project?

    And as it hap­pens, I’ve rarely met peo­ple who have explored the solu­tion set and map­ping to objec­tive space of real prob­lems. And fewer peo­ple who have solved prob­lems or run opti­miza­tion “online” — pay­ing atten­tion to what’s hap­pen­ing, and chang­ing their goals or definitions.

    But your point is absolutely right, and helps clar­ify my think­ing: Of course there are con­vex pro­jec­tions of high-​​dimensional mul­ti­ob­jec­tive search spaces onto lower dimen­sions. One can use an arith­metic com­bi­na­tion model in many com­mon cases, whether it’s an affine com­bi­na­tion of objec­tives or even a non­lin­ear map­ping. There are plenty of cun­ning sta­tis­ti­cal meth­ods one could use to sim­plify all kinds of fun stuff in hard prob­lems; prin­ci­ple com­po­nents analy­sis leaps to mind.

    So yes, you’re right. I’m using hyper­bole. What I’m always yelling about is some­times unimportant.

    When a tech­ni­cian, or even the decision-​​maker her­self, decides to col­lapse objec­tives with­out first mod­el­ing the objec­tive space by sam­pling or first-​​principles mod­el­ing, that’s a mis­take. They may be cor­rect in doing it, in their prob­lem, but the way they did it was a mistake.

    Not always a mis­take in terms of results. In the case of many sim­ple search space struc­tures, like the one you point out, it’s AOK.

    If sim­pli­fi­ca­tion tended to give a wrong answer, then the mythol­ogy of single-​​objective search wouldn’t have so much trac­tion. Peo­ple would know bet­ter if they encoun­tered prob­lems sooner.

    My point is that for many prac­ti­tion­ers, espe­cially “in the­ory” and in the Acad­emy, life has given them an easy ride. They’re mak­ing assump­tions based on a string of early suc­cesses. Peo­ple raised in a cul­ture of mod­els assume they can get away with this crap in real life, and I admit that bugs me. Look at Eco­nom­ics for analo­gies, if you want. Pre­ma­ture sim­pli­fi­ca­tion with­out rig­or­ous real­ity check is a mis­take in terms of project risk.

    Let’s change gears briefly. Look at soft­ware development.

    Sup­pose I’m a project man­ager. Some cow­boy solo coder tells me not to worry about the code he wrote by him­self, in the mid­dle of the night, with­out test­ing, with­out hav­ing the cus­tomer in the room at the time. With­out any unit tests. With­out a pair pro­gram­mer. Sounds like a stu­dent pro­gram­mer I used to know be.

    That’s project risk: The cus­tomer is at risk because the coder might have failed to do what they thought they were doing (“made mis­takes”). And also because the cus­tomer might have seen the inter­me­di­ate results and changed her mind about some­thing she spec­i­fied ear­lier in the project. Or might have dis­cov­ered new busi­ness cases that weren’t spec­i­fied as goals.

    And the project is at risk in future because other pro­gram­mers can­not trust code like this black box, not with­out going to the trou­ble of writ­ing their own suite of auto­mated unit tests. Worst: the rest of the world wasn’t involved in the con­ver­sa­tion at all, didn’t share the knowl­edge that this code rep­re­sents; I’ll bet even the cow­boy and the client will have to start over if they try to do sim­i­lar work again.

    That’s a bunch of risk. More risk if it’s an inter­est­ing project. All risks that the project will fail—where suc­cess is defined as deliv­er­ing what the client needs.

    And yes, I know the cow­boy coder I described is “doing it the nor­mal way.” That’s often the wrong way.

    Is what the client spec­i­fies always what she needs? Sup­pose she came in and handed a three-​​ring binder of spec­i­fi­ca­tions to a smart pro­gram­ming team. Wrote down what she wanted, all the way to the func­tion call struc­ture, walked away, and came back to get the results when the team had writ­ten all the soft­ware spec­i­fied in the stu­pid binder. No ques­tions about lan­guage, about typos in the binder, about bet­ter ways to write the code that a team might know, no insight gained by actu­ally mak­ing the project work. Just up-​​front spec­i­fi­ca­tion of the exact solu­tion she wants.

    I’ll argue that project is also at risk. For many of the same rea­sons as the first one. It’s inag­ile, and any of us who have worked on soft­ware devel­op­ment projects know there’s always a bet­ter way to solve a prob­lem than the first one you con­sider. You always adapt.

    Now. Think about any com­plex search and opti­miza­tion project. Stock trad­ing. Phar­ma­ceu­ti­cal design. Sup­ply chain man­age­ment. Trans­porta­tion sched­ul­ing. Any­thing more inter­est­ing than stick­ing integer-​​sized boxes in a back­pack so a sales­man can deliver them to the Tow­ers of Hanoi.

    Think about what you just said about mod­el­ing. About your assump­tion that we know ahead of time the rela­tion­ship between vari­ables. Think about what that implies about solu­tions’ inter­ac­tion with com­plex non­lin­ear constraints.

    Think about when we make con­tin­gent deci­sions to com­bine objec­tives or not, depend­ing on the way they inter­act with one another in a given problem.

    So for sake of the other read­ers, con­sider the design pat­tern you just spelled out: Take two sep­a­rate non­dom­i­nated solu­tions (both are ele­ments of the solu­tion set mapped to points in the objec­tive space) and cre­ate a new model that’s a sim­ple sto­chas­tic mix­ture. You’ve defined a new point in the solu­tion set—assuming I allowed you to include sto­chas­tic com­bi­na­tion in the solu­tion set’s definition—and this maps to a new point in objec­tive space.

    Your point is that for cer­tain aggre­gate error mea­sures, this new point will lie between the first two in the objec­tive space.

    If we (who­ever “we” is) are cer­tain we’ll never, ever care about any­thing but the aver­age error, never care about that lit­tle con­stant you men­tion, or the descrip­tion length of your second-​​order solu­tion, but only the num­ber of states it passes through in exe­cut­ing its algo­rithm… sure. You’re right.

    How’s that work when we ever con­sider the vari­ance of the errors, though? What if the non­dom­i­nated “sim­ple error-​​prone” model is a coin-​​flip itself? What if I didn’t let you have sto­chas­tic ele­ments when I spec­i­fied the prob­lem, if all mod­els were deter­min­is­tic? What if the data we’re test­ing your mod­els on is itself full of struc­ture, of the type which biases the log-​​likelihood errors?

    What if, what if?

    That’s really the basis of all these years of com­plaint about mul­ti­ob­jec­tive search. There are always con­tin­gen­cies. Always rea­sons to change, to improve, to adapt any model, any algo­rithm, any approach.

    Lin­ear pro­gram­ming is a stu­pid way to solve prob­lems more often than it’s the right way… but it’s right some­times. Neural nets are stu­pid more often than they’re use­ful, but some­times they’re the right way. Single-​​objective search is stu­pid more often than not. Non­adap­tive search, offline opti­miza­tion: stu­pid. Blind appli­ca­tion of any method­ol­ogy, or mak­ing assump­tions about objec­tives or con­straints or para­me­ters or rela­tions or con­stants: stupid.

    We all know this.

    So let’s go back to that soft­ware anal­ogy I was using, just for one minute. If I were run­ning a pro­gram­ming team, they would be Agile. They’d be using XP or Scrum or some­thing to address that risk I men­tioned. There would be a cus­tomer on the team. There would be tests. There would be objec­tive, com­mu­nica­tive tests of all design decisions.

    So: Write me a test that will detect the con­di­tions under which a map­ping from solu­tion set to objec­tive space can be assumed, with high prob­a­bil­ity, to be con­vex. Alter­nately, and more use­fully, write me a test that will detect a fail­ure in that assump­tion. Some­thing I can attach to any search method I choose, as a mon­i­tor or a dec­o­ra­tor, and see when things are get­ting out of hand. A diag­nos­tic. You can do it sta­tis­ti­cally, or ana­lyt­i­cally. I don’t care.

    When a decision-​​maker can ask me to reduce the num­ber of objec­tives for her prob­lem, and we can run some tests and she can decide with con­fi­dence whether it’s safe or not, then I’ll be happy. When I am in the mid­dle of a project and we’ve decided to col­lapse some objec­tives and a lit­tle alarm goes off say­ing “FAIL: con­vex­ity assump­tion vio­lated”, then I’ll be happy.

    If I have to remem­ber to check all the time… well, that’s stu­pid. Just as stu­pid as if I had started using Lin­ear Pro­gram­ming at the begin­ning with­out check­ing to see whether the prob­lem was well suited to it, and spent hours run­ning LP algo­rithms with no way of know­ing if the solu­tions were converging.

    See: you and I, and Nic and Ric­cardo and Koza and my friends on the Nudge project and all the rest, we’re in worse shape than almost every other search geek. Think about it. Go back to that soft­ware project anal­ogy. Not only are we try­ing to model com­pli­cated prob­lems at the same time we’re search­ing through them—in the anal­ogy, spec­ify and refine busi­ness deliv­er­ables at the same time they’re being devel­oped. We’re using genetic fuck­ing pro­gram­ming.

    We get solu­tions, but we have to fig­ure out how the solu­tions work. Think about it. A mil­lion cow­boy pro­gram­mers writ­ing a mil­lion black boxes, all try­ing to write pro­grams that we spec­ify up front?

    Or are we bet­ter off if we watch what search is mak­ing them dis­cover, and reach in and inter­act, and drive solu­tions toward what we mutu­ally dis­cover are not only bet­ter solu­tions to what we spec­i­fied ini­tially, but also better-​​specified?

    How many times have we seen a GP search spit out an answer that opti­mizes every­thing we asked for… but which nonethe­less “cheats”? Fails because of some­thing we didn’t spell out.

    I pre­fer dia­log. Agility. Con­stant inter­ac­tiv­ity. On-​​line search and opti­miza­tion, only, all the way down. Because I don’t even trust myself to spec­ify or solve inter­est­ing prob­lems. I don’t trust myself to know what I want, or how to describe it, or to know before­hand how my toolkit will inter­act with it. I don’t even trust myself to stick to the plan. I don’t trust myself programming.

    If I don’t even trust myself, even though I so obvi­ously think highly of myself, con­sider how much I trust pro­gram­mers writ­ing pro­grams for me, or run­ning an opti­miza­tion for me, or opti­miz­ers solv­ing other com­plex prob­lems for me.

    Espe­cially when the “pro­gram­mer” or the opti­mizer isn’t even human.

    So that’s why I con­tinue advis­ing peo­ple never to sim­plify pre­ma­turely. I think I’d advise them never to do any­thing algo­rith­mic at all unless they’ve got fine-​​grained auto­mated tests in place to cover the details, and reli­able accep­tance tests in place to inter­ro­gate prospec­tive “solutions”.

    But I don’t know what those tests are, yet. You sketched one, though you didn’t present a reli­able auto­mated ver­sion yet. Until then, every time we step ahead with­out those explicit tests in place—so we know when some­thing com­pli­cated happens—our project is at greater risk of failure.

    And com­bin­ing objec­tives is exactly the sort of step I mean.

  13. Wow Bill, a post wor­thy of a blog entry in its own right. But this thread is get­ting inter­est­ing, so I’m not let­ting it die yet.

    First of all, I want to re-​​emphasize that I’m not advo­cat­ing lin­early com­bin­ing objec­tives in order to reduce the objec­tive space. I am sim­ply say­ing that usu­ally, one is inter­ested in the con­vex hull of the Pareto set, not the entire set. This is still a large set, and the set needs to be dis­cov­ered. This may sound like a dan­ger­ous sim­pli­fi­ca­tion to you, but I think it is war­ranted, espe­cially in the error /​ size trade-​​off I sketched.

    Multi-​​objective prob­lems are multi-​​dimensional, and pre­ma­turely reduc­ing them to a sin­gle objec­tive will lead to poor results. I com­pletely agree with you that you want the algo­rithm to search out the alter­na­tives, and let the deci­sion maker pick the best (com­bi­na­tion of) solu­tions *after* the set has been produced.

    If you present it to the ‘cus­tomer’ in this way, even with a full Pareto based approach, you will notice that the deci­sion maker will always choose a solu­tion on the con­vex hull, never one in a con­cave bend. (Yes, also I can be very con­fi­dent in my asser­tions with­out any evi­dence to back me up)

    Yes, these con­cave bends are very inter­est­ing, almost as inter­est­ing as the first front of *dom­i­nated* solu­tions. That is, poten­tially impor­tant, but given that we’ve got a full con­vex hull to go through for analy­sis, they can wait.

    I think we’re on the same page gen­er­ally, as also I have stared and stared at what this genetic pro­gram­ming is pro­duc­ing. I vio­lently agree that one needs to exam­ine and re-​​examine every­thing that is pro­duced, try out dif­fer­ent things. I love to inspect the evo­lu­tion­ary trees that pro­duce a solu­tion (I believe that Nic is one of the few who has that same hobby). I have got­ten great result by mix­ing objec­tives. I’m just inter­ested to make my life a bit eas­ier, by reduc­ing the solu­tion set to this con­vex hull. Why?

    I’m into this sym­bolic regres­sion thing. From the way I view it, each and every fit­ness case has an unknown asso­ci­ated with it, the noise of that point (even when assum­ing Gauss­ian noise). Cal­cu­lat­ing some­thing triv­ial as the MSE is already cheat­ing, as you put the assump­tion in that every point has exactly the same noise. You’re reduc­ing the objec­tive space from all points to an unweighted aver­age. We don’t know if this is the case, it’s an assumption.

    So now, if I have 100 dat­a­points, I’m the­o­ret­i­cally con­fronted with at least 101 objec­tives: one objec­tive for each dat­a­point, and one (or more) objec­tives for the suc­cinct­ness of the solu­tion. For starters, each con­stant that is a tar­get value is part of the con­vex hull, and thus part of the Pareto set. Now, if I were to do a full Pareto dom­i­nance opti­miza­tion in this 101 dimen­sional space, this would become a night­mare. Not only does the size of the solu­tion set becomes intractable (also the con­vex hull would be huge), the algo­rithm itself would come to a screech­ing halt sim­ply because of the sheer com­plex­ity of cal­cu­lat­ing the dom­i­nance rela­tion. Con­vex­ity can how­ever be used to at cull the space a bit, and can be used for quicker algo­rithms to deter­mine if the point is on the hull or not. I would also not want to waste my time exam­in­ing these con­cave bends when there are more promis­ing points.

    No, I’m cur­rently not tack­ling prob­lems this way (although I would like to), but just mak­ing the point that with a grow­ing num­ber of objec­tives, uti­liz­ing con­vex­ity is going to be a help, both in get­ting the algo­rithms to per­form, and in culling the solu­tion set to the most inter­est­ing points.

    Also given the strong the­o­ret­i­cal sup­port for con­vex hulls as objects of inter­ests next to Pareto sets, I’m amazed for years now that the Multi-​​Objective com­mu­nity in EC has com­pletely ignored it, and is fully focused on Pareto sets alone (with addi­tional assump­tions (eta-​​dominance) to man­age more-​​than-​​two objec­tive prob­lems). Pareto sets are more dif­fi­cult to find and to man­age than con­vex hulls, and the addi­tional ben­e­fit of these con­cave bends that they can induce is at best mar­ginal. You would find them by exam­in­ing the sec­ond con­vex hull.

    As for assump­tions, the addi­tional assump­tion of con­vex­ity over Pareto dom­i­nance is that you can indeed sto­chas­ti­cally mix solu­tions.
    As far as assump­tion go, this is quite a bet­ter assump­tion than assum­ing that the noise is dis­trib­uted equally over all your dat­a­points. Now that’s a dan­ger­ous one! ROC curves are also based on the prin­ci­ple of sto­chas­tic mix­ing. It’s sound, it works in many case, it’s what your deci­sion maker wants.

    The con­vex­ity assump­tion usu­ally works (as a first-​​order approx­i­ma­tion) for error mea­sures (actu­ally it only works exactly for absolute error, the line between two points has a lit­tle bend based on the covari­ance of the two solu­tions for squared error.), and for speed it is quite correct.

    For size it would be incor­rect, but given that size itself is already a hope­less mea­sure of com­plex­ity (the Kol­mogorov kind), also Pareto con­cav­ity will eas­ily lead one astray.

    So, what am I say­ing here? Pos­si­bly: bet­ter to add a few more objec­tives and keep things man­age­able by mainly con­sid­er­ing the con­vex hull, than using fewer objec­tives so that you get this con­cave
    stuff as well.

    And now I’ve got to quit. Need to get some sleep. Got a scrum first thing tomor­row morning.

  14. Hello,

    I was won­der­ing if there is an imple­men­ta­tion of Pare­toGP ? and any ideas of its per­for­mance com­pared to oth­ers such as SPEA2,NGSA-II

    thank you!

  15. Well, the vast major­ity of that went over my head.

    I just picked through some genetic algo­rithms, and I’m notic­ing some inter­est­ing trends.

    - They tend to be good at iden­ti­fy­ing *roughly* where a local min­i­mum is located in a space, but become ter­ri­bly inef­fi­cient at pro­gress­ing from there.

    - Muta­tion and crossover rates need to be dynam­i­cally bal­anced based on the vari­ance, espe­cially in dynamic sets, to pre­vent pre­ma­ture homog­e­niza­tion of the pop­u­la­tion. There shouldn’t be a trig­ger for hyper-​​mutation, it should be dynamic.

    - Non-​​mutated can­di­dates are worth­less except to add weight to the suc­cess of a trait. These can­di­dates tend to be heav­ier than sim­ply track­ing a data point which can be used to deter­mine ‘viril­ity’ of a trait.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>