\( \newcommand{\argmax}{\operatorname{arg\,max}\limits} \) \( \newcommand{\P}[1]{\mathbf{P} \left(#1\right)} \) \( \newcommand{\E}{\mathbf{E}} \) \( \newcommand{\R}{\mathbb{R}} \) \( \newcommand{\set}[1]{\left\{#1\right\}} \) \( \newcommand{\floor}[1]{\left \lfloor {#1} \right\rfloor} \) \( \newcommand{\ceil}[1]{\left \lceil {#1} \right\rceil} \) \( \newcommand{\logp}{\log_{+}\!} \) \( \let\epsilon\varepsilon\)

Finite-Armed Bandit Notation

The finite-armed bandit problem is a sequential game played over $n$ rounds (the horizon) between a learner and an environment. The value of $n$ is sometimes known to the learner, but sometimes not.

In each round the learner chooses an action $I_t$ from one of $K$ choices. That is, $I_t \in \set{1,2,\ldots,K}$. The learner then observes a reward $X_t$, which is sampled from some distribution depending on $\nu_{I_t}$ where $\nu_1,\nu_2,\ldots,\nu_K$ is some set of distributions with means $\mu_1,\mu_2,\ldots,\mu_K$.

The goal is to maximise the cumulative reward over all $n$ rounds, which makes the optimal action the one corresponding to the distribution with the largest mean. That is $i^* = \argmax_i \mu_i$ and $\mu^* = \max_i \mu_i$. Of course we assume that $\nu_1,\nu_2,\ldots,\nu_K$ are not known to the learner, so it is not possible to choose only the optimal action from the beginning.

The pseudo-regret of strategy $\pi$ when playing against bandit $\nu$ over $n$ rounds is denoted by
R^\pi_\nu(n) = n \mu^* – \E\left[\sum_{t=1}^n \mu_{I_t}\right]

Although $\nu$ is not known to the learner, it is typical to make some assumptions. For example, you might assume that $\nu_i$ is Gaussian for all $i$ with unit variance, but with unknown mean. That is $\nu_i = \mathcal N(\mu_i, 1)$ for some $\mu_i$. Alternatively you might assume $\nu_i$ is Bernoulli or subgaussian or even only that it has bounded variance. If the $\nu_i$ are assumed to be parameterised by their means, then I will often write the regret as $R^\pi_\mu(n)$.

Other standard notation is $\hat \mu_{i,s}$ to be the empirical mean of the first $s$ times action $i$ was chosen and
$T_i(t)$ to be the number of times action $i$ was chosen after round $t$ and $\hat \mu_i(t)$ to be its empirical estimate after round $t$, which is just $\hat \mu_{i, T_i(t)}$.

The suboptimality gaps for each arm also play an important role and are denoted by $\Delta_i = \mu^* – \mu_i$.