Items of some interest…

These are my recent Pin​board​.in links:

  • [1111.1797] Analy­sis of Thomp­son Sam­pling for the multi-​​armed ban­dit problem

    e multi-​​armed ban­dit prob­lem is a pop­u­lar model for study­ing exploration/​exploitation trade-​​off in sequen­tial deci­sion prob­lems. Many algo­rithms are now avail­able for this well-​​studied prob­lem. One of the ear­li­est algo­rithms, given by W. R. Thomp­son, dates back to 1933. This algo­rithm, referred to as Thomp­son Sam­pling, is a nat­ural Bayesian algo­rithm. The basic idea is to choose an arm to play accord­ing to its prob­a­bil­ity of being the best arm. Thomp­son Sam­pling algo­rithm has exper­i­men­tally been shown to be close to opti­mal. In addi­tion, it is effi­cient to imple­ment and exhibits sev­eral desir­able prop­er­ties such as small regret for delayed feed­back. How­ever, the­o­ret­i­cal under­stand­ing of this algo­rithm was quite lim­ited. In this paper, for the first time, we show that Thomp­son Sam­pling algo­rithm achieves log­a­rith­mic expected regret for the multi-​​armed ban­dit prob­lem. More pre­cisely, for the two-​​armed ban­dit prob­lem, the expected regret in time $T$ is $O(frac{ln T}{Delta} + frac{1}{Delta^3})$. And, for the $N$-armed ban­dit prob­lem, the expected regret in time $T$ is $O([(sum_{i=2}^N frac{1}{Delta_i^2})^2] ln T)$. Our bounds are opti­mal but for the depen­dence on $Delta_​i$ and the con­stant fac­tors in big-​​Oh.

    probability-​​theory machine-​​learning exploitation-​​exploration nudge-​​targets game-​​theory

Comments are closed.