Jekyll2022-09-27T10:59:12-07:00https://zhaoqil.github.io/feed.xmlZhaoqi Lipersonal descriptionZhaoqi LiTechniques for bandit problems2022-07-05T00:00:00-07:002022-07-05T00:00:00-07:00https://zhaoqil.github.io/posts/2022/07/blog-ucb<p>As the field of bandit evolves, there are tons of papers showing up every year. Rather than reading them one by one, this post aims to give the most general intuition on how to think about this type of problem, and three most famous benchmark algorithms.</p>
<p>We go back to the lottery problem described in the last post. The reason we need to think about strategies is that we are <strong>uncertain</strong> which machine gives the best winning rate. If we know which one is the best, we would just play that one and not worrying about best-arm identification and regret minimization and so forth. Uncertainty is a key issue we would like to overcome. In some sense, we are making inference about the underlying unknown characteristics of the machine by our observation.
How would we measure “uncertainty”? In the lottery example, this is measured by the variance. If we let $p_1$, $p_2$, $p_3$ be the winning probability of machine 1, 2, and 3, then the variance measures how large each observation deviates from those true probabilities. In order to find out the best machine, we need to minimize the uncertainty as quick as possible.</p>
<p>How to minimize uncertainty? On one hand, if you play one machine more and more times, you will get a better sense of how good the machine is. On the other hand, if you play more and more machines, you will have a better chance of playing the best machine. Therefore, there is a natural tradeoff between playing more machines and playing one machine more times. This tradeoff is called the balance between <strong>exploration</strong> and <strong>exploitation</strong>. All bandit algorithms are trying to find the best balance.</p>
<p>In bandit literature, there are three most famous benchmark algorithms for this purpose. Two for regret minimization: UCB and Thompson sampling, and one for best-arm identification: elimination. In what follows, we will introduce UCB and elimination algorithms and the intuition behind them.</p>
<h3 id="elimination-algorithm">Elimination algorithm</h3>
<p>The elimination algorithm is used for finding the best arm, or in the lottery example, finding the best machine. The intuition behind this algorithm is that when you play each machine enough times, the uncertainty decreases. You will have an idea that certain machines is probably not going to be the best, since they perform so badly in these times you play them. Therefore, you only need to keep playing those machines which could still be good. More formally, this algorithm proceeds in rounds. In round $\ell$, it eliminates all arms whose gap to the best arm is greater than $2^{\ell}$. The framework of the algorithm is as follows:</p>
<p><img src="/images/elimination.png" alt="" /></p>
<h3 id="ucb-algorithm">UCB algorithm</h3>
<p>The UCB algorithm is used for minimizing the regret, in other words, lose the least amount of money in 100 runs. The principle behind this is “optimism in the face of uncertainty”. In the lottery example, when you play a machine for some times, you can construct a confidence interval for the true winning probability. The UCB algorithm, in each round, plays the arm whose upper confidence bound is the largest. In other words, in each round, it plays the machine who could potentially has the largest winning probability, given the current information. The framework of the algorithm is as follows:</p>
<p><img src="/images/ucb.png" alt="" /></p>Zhaoqi LiAs the field of bandit evolves, there are tons of papers showing up every year. Rather than reading them one by one, this post aims to give the most general intuition on how to think about this type of problem, and three most famous benchmark algorithms.A general introduction to multi-armed bandits2022-06-20T00:00:00-07:002022-06-20T00:00:00-07:00https://zhaoqil.github.io/posts/2022/06/blog-mab<p>Suppose you are in Vegas facing three lottery machines, each with a different probability of winning a prize:</p>
<p><img src="/images/lotterys.png" alt="" /></p>
<p>You would like to figure out which one wins the most, so you try out these machines:</p>
<p><img src="/images/lotteryresult.png" alt="" /></p>
<p>After trying out many times, you start thinking about strategies:</p>
<p><img src="/images/ideas.png" alt="" /></p>
<p>What would be the answer to these questions?</p>
<p>Congratulations, you have entered the field of multi-armed bandits!</p>
<h3 id="best-arm-identification-vs-regret-minimization">Best-arm identification vs. regret minimization</h3>
<p>A <em>multi-armed bandit</em> is a machine with multiple slots where each slot has multiple options. When choosing an option, an outcome will be observed. The above two ideas actually correspond to two branches of multi-armed bandit problems. The first corresponds to what’s called “best-arm identification” problems. In this setting, people care about finding the “best one”, and they are usually interested in the question “how many plays do I need to take to find the best one?”. In the bandit community, this number is called the <em>sample complexity</em>. The second corresponds to what’s called “regret minimization” problems. In this setting, people care about the process, and they would hope that they lose the least amount of money during each play of the machine. The amount of money they lose comparing to playing the best one is called the <em>regret</em>.</p>
<p>The second idea is pretty natural to think about since it might actually cost much to find the best machine. Intuitively, to find the best one, you need to play each machine many times to get a sense of how good it is. This is fine if you happen to play the best machine, but for the not-so-good machines playing them many times will make you lose a lot of money. On the other hand, there are strategies that actually lose less money even if you don’t know the best one. This strategy seems magical, right? But it turns out that it gives the best long-term benefit by avoiding to play the bad machine, while this actually takes longer to find the “best one” since it more or less does not care what the literally best one is, as long as it keeps playing the good enough machines.</p>
<h3 id="another-view-of-ab-testing">Another view of A/B testing</h3>
<p>The story of lottery machine ends here (it will come back later), but the story of bandits has just started. Bandits can be useful far more than just for lottery. In fact, bandit problems come all over the places in statistics, computer science, biomedical sciences, economics, etc. In statistics, A/B testing is a popular topic and there are hundreds of articles and YouTube videos to explain how it works. Also, it is one of the essential tools used in large tech companies like Amazon, Meta, Google, etc. On the other hand, in bandit language, A/B testing is just a special case of multi-armed bandit, and we present another view of the A/B testing here.</p>
<p>A/B testing is a experimental method for testing new products, e.g. the layout of a webpage, effect of a new drug, etc. In A/B testing, random group of people receive two versions of the same design. One is called “control”, which is usually the old design, and the other one is called “treatment”, which is usually the new design. Suppose we are testing a new design of a webpage where the “BUY NOW” button becomes red when it is clicked, and we are interested in whether this will increase the chance that people will click this button.</p>
<p><img src="/images/AB.png" alt="" /></p>
<p>How should we assign the group of people and how large of a crowd do we need in order to determine whether we are confident to say that the click rate has increased? We start off by assuming that every individual is identical and independent. On one hand, in A/B testing, with this assumption, one will perform a hypothesis test and look at the p-value and see how large each group needs to make the p-value less than 0.05. On the other hand, we can view this scenario as a two-armed bandit, where each person assigned to “control” corresponds to playing the “control” arm, and each person assigned to “treatment” corresponds to playing the “treatment” arm.</p>
<p>In this case, the size of each group corresponds to the number of times we play the arm, and we are essentially finding the “best arm”, which corresponds to the better design of the two. We notice that this is exactly the “best-arm identification” setting, and with tools in the bandit literature, we can find exactly the smallest size that each group needs in order to control the p-value to be less than 0.05.</p>
<p>The benefit of this view is the generalization. By extending two-armed bandits to multi-armed bandits, we could use similar tools to easily generalize A/B testing to multivariate testing. Also, in biomedical sciences, people conduct clinical trials when developing a medicine. This is again a two-armed bandit where each slot corresponds to a patient and an action corresponds to perform either treatment or control to this patient. With the tools from bandits, we can find the optimal allocation of treatment to the patients.</p>Zhaoqi LiSuppose you are in Vegas facing three lottery machines, each with a different probability of winning a prize: