Balancing sums of random vectors

We study a higher-dimensional 'balls-into-bins' problem. An infinite sequence of i.i.d. random vectors is revealed to us one vector at a time, and we are required to partition these vectors into a fixed number of bins in such a way as to keep the sums of the vectors in the different bins close together; how close can we keep these sums almost surely? This question, our primary focus in this paper, is closely related to the classical problem of partitioning a sequence of vectors into balanced subsequences, in addition to having applications to some problems in computer science.


Introduction
In this note, we consider the following partitioning problem. Let V(µ) = (V n ) n≥1 be a sequence of independent random vectors, all distributed according to some common probability distribution µ on the d-dimensional Euclidean unit ball B d ⊂ R d . The elements of this i.i.d. sequence V(µ) are revealed to us in order, one vector at a time. Each time a new vector is revealed to us, we are required to assign this vector to one of a fixed number of bins B 1 , B 2 , . . . , B k before seeing the next vector in the sequence. By adopting a suitable strategy to assign vectors to bins, how 'close together' can we keep the sums of the vectors in the different bins?
A more precise formulation of this question is as follows. For each 1 ≤ i ≤ k, let B 0 i = 0, and for a positive integer n ∈ N, let B n i denote the sum of the vectors in the bin B i at time n; in other words, A partitioning strategy is a (possibly randomised) map from (R d ) k+1 to the set of bins {B 1 , B 2 , . . . , B k } which, given the vectors B n 1 , B n 2 , . . . , B n k and V n+1 , tells us which bin V n+1 should be assigned to; in the language of computer science, a partitioning strategy is an online algorithm for assigning vectors to bins. We shall mainly be interested in partitioning strategies that minimise, for (large) T ≥ 0, the quantity the largest observed Euclidean distance between a pair of bins up to time T . In this paper, we shall mostly be concerned with the asymptotic behaviour of D(T ) as T → ∞ while the dimension d ≥ 1, the number of bins k ≥ 2 and the distribution µ remain fixed, so it is perhaps worth emphasising that we choose to work with the Euclidean norm and to track the largest observed distance between any pair of bins purely for concreteness; indeed, all of our results hold as stated, albeit with different implied constants, for any choice of norm, and for any well-defined notion that tracks how close together the various bins are, such as the largest distance between any bin and the average of the bins, for example. The fact that serves as the starting point for the work here is the classical result (see [2], for instance) that by assigning vectors to bins uniformly at random, one can always ensure that D(T ) = O( √ T log T ); we shall attempt to quantify by how much one can hope to improve on this. We discuss two other motivations for studying the problem at hand below.
First, the related problem of partitioning a deterministic sequence of vectors from the d-dimensional unit ball into a fixed number of 'balanced subsequences' has a rich history and may be traced back to an old question due to Riemann and Lévy that was subsequently answered by Steinitz [14]; various forms of this problem have since been investigated and we refer the interested reader to the survey of Bárány [6]. We mention one result in this area that is relevant to the problem at hand. Let V = (V n ) n≥1 be any sequence of vectors lying in the d-dimensional unit ball B d ⊂ R d , and consider the problem of assigning each element of this sequence of vectors to one of a fixed number of bins B 1 , B 2 , . . . , B k as before, except using a prescient partitioning strategy: by a prescient partitioning strategy, we mean a strategy that is allowed to see the entire sequence V ahead of time. Improving on an earlier result of Doerr and Srivastav [11], Bárány and Doerr [8] proved that there exists a prescient partitioning strategy that ensures that D(T ) ≤ Cd uniformly in T for any d, k ∈ N with k ≥ 2 and any sequence V as above, where C ≈ 4.001 is a universal constant (though it should be noted that while these prescient strategies do require 'knowledge of the future', they only need to look Θ(d) vectors ahead). In the light of this fact, it is natural to ask what changes when we are required to partition V without any knowledge of the future.
Next, the question we study here arises naturally in the context of load-balancing, resource allocation and scheduling problems in large scale computation. In these settings, there is a finite set of servers and a sequence of incoming jobs. Jobs must be allocated to servers as and when they arrive, and each job consumes certain quantities of the different resources (memory or processing power, for example) available on the server it is assigned to. Ideally, one would like to allocate the jobs in a 'balanced' fashion where the total load on each server is roughly the same; see the survey of Azar [3] for a short introduction to this area. Partitioning strategies that perform well in the 'worst-case' can often be suboptimal in practice since the empirical distribution of the incoming jobs is typically random (and not adversarial). Therefore, various probabilistic models for these problems have been studied over the last twenty years and a number of results have been proved in different settings. These results, for the most part, deal with the one-dimensional case of the problem we study and address various questions about distributing (possibly weighted) balls into bins; for a small sample of the existing literature, see [4,1,15,13]. The higher-dimensional problem we study here is not only inherently interesting, but also exhibits genuinely different behaviour compared to the one-dimensional problem, as we shall shortly see. Finally, let us make two remarks with practical applications in mind: first, in practice, we lose no generality by assuming that the 'job-vectors' come from the unit ball as our results remain valid after any suitable (finite) rescaling; second, fine tuning our partitioning strategies for specific distributions can make the strategies fragile, so we focus on results that either hold for all probability distributions on the unit ball, or are robust for a wide class of 'nice' distributions.

Our results
A few remarks about notation are in order before we state our results. In what follows, we write [n] for the set {1, 2, . . . , n}. We use ·, · to denote the standard inner product in R d and · to denote the associated Euclidean norm. Also, we write λ d for the d-dimensional Lebesgue measure. We shall make use of standard asymptotic notation; in what follows, the variable tending to infinity will always be T unless we explicitly specify otherwise. Constants suppressed by the asymptotic notation are allowed to depend on the fixed parameters (d, k and µ) but not on T . Finally, we use the term with high probability to mean with probability tending to 1 as T → ∞.
Our first result, a strategy-agnostic lower bound on D(T ), will serve as a useful benchmark. The above proposition immediately highlights the difference between the one-dimensional partitioning problem and the same problem in higher dimensions. Indeed, if we have a sequence of i.i.d. vectors distributed according to some common distribution µ on B 1 = [−1, 1], then the trivial partitioning strategy that assigns a vector V to the bin with the largest sum if V < 0 and to the bin with the smallest sum if V > 0 shows that in one dimension, we may uniformly ensure that D(T ) ≤ 1 for any number of bins k and any distribution µ.
In an attempt to match the lower bound in Proposition 2.1, we consider two different partitioning strategies below. Note that for any d, k ∈ N and any distribution µ on B d , we may, as discussed earlier, toss each element of V(µ) into one of the k bins uniformly at random and thereby ensure with high probability that D(T ) = O( √ T log T ). To improve on this trivial bound, we shall have to work a bit harder.
The first partitioning strategy we propose, which we call the inner product rule, is as follows: simply assign V n+1 to the bin B i for which V n+1 , B n i is minimal, breaking ties arbitrarily. Intuitively, this should keep the bins close together since we always add a vector to the bin it is 'most opposite' to. We shall show that the inner product rule is a near-optimal partitioning strategy for any reasonably well-behaved probability distribution. Recall that a measure µ on R d is Hölder continuous (with respect to the Lebesgue measure) if there exist constants K, α > 0 such that µ(S) ≤ Kλ d (S) α for any measurable set S ⊂ R d ; the following bounds for the inner product rule are essentially tight.
Theorem 2.2. Fix d, k ∈ N with k ≥ 2, let µ be a probability distribution on B d ⊂ R d , and suppose that we partition V(µ) into k bins using the inner product rule. Then almost surely, if µ is additionally Hölder continuous, then almost surely, In the light of Proposition 2.1, it is immediately clear that (2) is essentially best-possible. The discrepancy between the bound (2) for well-behaved distributions and the bound (1) for arbitrary distributions in Theorem 2.2 is not just an artefact of our proof: somewhat surprisingly, the next proposition demonstrates the existence of (slightly pathological) distributions which show that (1) is also nearly tight. Proposition 2.3. For every increasing function ω : R >0 → R >0 that grows without bound, there exists a probability distribution µ ω on B 2 ⊂ R 2 for which the following holds. If we partition V(µ ω ) into two bins using the inner product rule, then almost surely, The other strategy we consider is motivated by more practical considerations: in applications, where the number of bins k is often large, it is usually too expensive to compute k inner products to make each decision. With this in mind, we investigate the following strategy, a higher dimensional analogue of the 'two random choices' strategy studied by Azar, Broder, Karlin and Upfal [5], which we call the best-of-two rule. Unlike the inner product rule, the best-of-two rule is a randomised strategy: given V n+1 , we choose two bins B i and B j randomly from the set of all bins (without replacement) and assign V n+1 to B i if V n+1 , B n i ≤ V n+1 , B n j and to B j otherwise, breaking ties arbitrarily. This strategy achieves a reduction in computational complexity, but this reduction comes at a price: the following estimate for the best-of-two rule is essentially tight.
Theorem 2.4. Fix d, k ∈ N with k ≥ 2, and let µ be any probability distribution on B d ⊂ R d . If we partition V(µ) into k bins using the best-of-two rule, then almost surely, Note that the best-of-two rule is identical to the inner product rule when k = 2. That Theorem 2.4 is essentially best-possible when k ≥ 3 is evidenced by the simple observation that if µ is the uniform distribution on B 1 = [−1, 1], then with high probability, there exists an interval of length Ω(log T ) in the first T steps where we repeatedly choose the same pair of bins and only see numbers exceeding 1/2. To prove Theorem 2.4, we shall show that the best-of-two rule enforces 'self-correction'. Similar methods based on self-correction have recently been used to answer some long-standing questions about random graph processes; see [9,12,10], for example. This paper is organised as follows. We give the proof of Proposition 2.1 in Section 3. Section 4 is devoted to analysing the inner product rule. We then address the best-of-two rule in Section 5. We finally conclude this note with a discussion of some open problems in Section 6. For the sake of clarity of presentation, we systematically omit floor and ceiling signs whenever they are not crucial.

Lower bounds
This section is devoted to the proof of Proposition 2.1, our strategy-agnostic lower bound. For completeness, we first record the following fact about the size of 'slices' of the d-dimensional unit ball.
Proposition 3.1. For any d ∈ N, there exist constants C, c > 0 such that for any e ∈ S d−1 and any We now prove Proposition 2.1.
Proof of Proposition 2.1. First, suppose that k = 2 and consider any partitioning strategy that partitions V(µ) into two bins. Let δ n = B n 1 − B n 2 and write e n for the unit vector in the direction of δ n . Note that we have δ n+1 where V = V n+1 if the strategy assigns V n+1 to B 1 and V = −V n+1 otherwise. Consider the event E n+1 = {1/2 ≤ δ n+1 2 − δ n 2 ≤ 5/4}. We claim that regardless of the partitioning strategy used, we have where c d > 0 is a constant depending on the dimension d alone. Indeed, regardless of which bin we assign V n+1 to, if V n+1 , e n ∈ [−(8 δ n ) −1 , (8 δ n ) −1 ] and V n+1 ≥ 3/4, then E n+1 holds. Since µ is the uniform distribution and V(µ) is an i.i.d. sequence, the claimed bound (3) follows easily from Proposition 3.1. Now, break the set [T ] into r = T /m disjoint blocks T 1 , T 2 , . . . , T r each of length m for some m = m(T ) that grows slowly with T (and will be specified later). We say that a block T is good if Using (3) and the fact that V(µ) is an i.i.d. sequence, we see that It follows that Using the Markov property, we now deduce that Applying the above bound with m = log T / log log T , we conclude that for all sufficiently large T ; the proposition, in the case where k = 2, now follows from the Borel-Cantelli lemma.
In contrast to the situation with the arguments for upper bounds (that follow in subsequent sections), we may easily obtain a lower bound in the case where the number of bins exceeds two from the argument above that deals with the case of exactly two bins. Indeed when k > 2, we proceed by 'merging' the bins B 1 , B 2 , . . . , B k and the bins B k +1 , B k +2 , . . . , B k into two auxiliary bins A 1 and A 2 respectively, where k = k/2 ; in other words, we set A n If k is even, then we finish the proof as follows. By the argument above, it is clear that regardless of the partitioning strategy used, there exists an n ∈ [T ] for which log T 2 log log T 1/2 with probability at least 1 − T −2 ; the result now follows from the triangle inequality. If k is odd on the other hand, then the result follows from an analogous argument where we track

The inner product rule
We shall analyse the inner product rule in this section. We need the following standard Chernoff-type bound; see [2] for a proof.
. . , X n are independent random variables taking values in {0, 1}, then writing We start by proving Theorem 2.2.
Proof of Theorem 2.2. Given m ≥ 0, we wish to bound P(D(T ) ≥ m) from above. Somewhat surprisingly, this is harder to do in the case where k ≥ 3 as opposed to when k = 2. Indeed, to control D(T ), we need to control the distance between each pair of bins; however, if we attempt to control these distances individually, we quickly run into difficulties because we cannot say much about how the distance between a particular pair of bins changes at each time (unless k = 2). The trick is to instead track the observable . First, writing e n (i, j) for the unit vector in the direction of δ n (i, j), note that In particular, under the inner product rule, we have if V n+1 is assigned to either B i or B j , and δ n+1 (i, j) = δ n (i, j) otherwise. Hence, if V n+1 is assigned to some bin B h , then Next, writing = km 2 /2, note that indeed, as a consequence of the triangle inequality, we have for any h, i, j ∈ [k]; summing this estimate over all h ∈ [k], we deduce that P(E n ( )).
Next, set r = /2k. Note that for n ≤ r, we have S n ≤ n(k − 1) < , so it follows that P(E n ( )) = 0. For n ≥ r + 1, we define Under the inner product rule, we know (see (4)) that S n+1 − S n ≤ k − 1 for all n ∈ N, so it is clear if S n ≥ , then S n−t ≥ /2 for each t ∈ {0, 1, . . . , r}. Therefore, it is clear that E n ( ) ⊂ F n ( ) for each n ≥ r + 1, so We shall estimate P(F n ( )) by studying how our observable can change in a single step using (4), which in turn will allow us to bound P(D(T ) ≥ m) using (5). We need slightly different arguments depending on whether or not the underlying distribution µ is well-behaved. The key difference between the two cases is that the probability that the change in our observable in a single step is 'bad' decays with in the case where µ is well-behaved (see Claim 4.4), but is merely bounded away from 1 in general (see Claim 4.3).
Case 1: Arbitrary distributions. We first establish (1) for an arbitrary probability distribution µ on B d . We proceed by induction over the dimension. The result is trivial in the case where d = 1 as the inner product rule coincides with the trivial one-dimensional strategy described in Section 2. Now, suppose that d > 1 and that we have established the required bound in dimension d − 1.
The starting point of our argument is the following observation.  Proof. We say that a point x ∈ B d is µ-heavy if µ(U) > 0 for every open neighbourhood U of x. If the set of µ-heavy points is contained in some hyperplane H passing through the origin, then it follows by compactness that for any ε > 0, µ(H ε ) = 1, where H ε is the set of points at distance less than ε from H; as µ is a probability measure, it follows that µ(H) = 1. Therefore, we may suppose that there exist µ-heavy points x 1 , x 2 , . . . , x d such that no hyperplane passing through the origin contains all of these points; in other words, we may assume that for every e ∈ S d−1 , there exists an i ∈ [d] such that | x i , e | > 0. This implies the conclusion of the lemma, once again by compactness.
We now apply Lemma 4.2 to µ: if µ(H ∩ B d ) = 1 for some hyperplane H passing through the origin, then we are done by induction; we may therefore assume that there exist A 1 , A 2 , . . . , A d and c, p > 0 as promised by Claim 4.2.
To bound P(F n ( )) from above, we first estimate, for each t ≥ 0, the probability of the event Proof. Relabelling the bins if necessary, suppose that the largest distance between a pair of bins at time t is the distance between the bins B 1 and B 2 . If S t ≥ /2, then it must be the case that δ t (1, 2) 2 ≥ S t /k 2 ≥ /2k 2 . We know from Lemma 4.2 that with probability at least p, we have Consider the bins B ± for which the inner products V t+1 , B t ± are maximal and minimal. By definition, V t+1 gets assigned to B − under the inner product rule. Now, since it follows from (4) that if S t ≥ /2, then with probability at least p, we have where last inequality holds provide is sufficiently large; this proves the claim.
Consider any interval of r steps in which our observable lies in the range [ /2, ] and note that since our observable increases by at most k − 1 at each step, there are at most rk 2 /c √ steps in this interval where our observable decreases by at least c √ /k. Consequently, if F n ( ) holds, then there are at most rk 2 /c √ ≤ rp/2 different values of t ∈ {n − r, n − r + 1, . . . , n − 1} for which the event I t ( ) holds, provided is sufficiently large. Using the Markov property, we deduce from Claim 4.3 and Proposition 4.1 that for all n ∈ N, we have P(F n ( )) ≤ exp (−rp/8) .
We know that P(D(T ) ≥ m) ≤ ∑ T n=1 P(F n ( )), so it is now clear that for all m ≥ 0, we have where = km 2 /2, r = /2k and p > 0 is a constant depending on d and µ alone. It follows from (6) that D(T ) = O((log T ) 1/2 ) with probability at least 1 − T −2 ; the required bound (1) now follows from the Borel-Cantelli lemma. Case 2: Hölder continuous distributions. We now show how we may improve on (1) for wellbehaved distributions. It turns out that if µ is Hölder continuous, then it is possible to say a lot more about how our observable changes in a single step than in the general case.
The starting point in this case is to bound, for each t ≥ 0, the probability of the event Claim 4.4. If µ is Hölder continuous, then for each t ≥ 0, where C, c > 0 are constants depending on d, k and µ alone.
Proof. If V t+1 is assigned to some bin B h , then, by (4), we have From the triangle inequality, Hence, if it so happens that Therefore, it follows that As µ is a Hölder continuous probability distribution and V(µ) is an i.i.d. sequence, the claim now follows from Proposition 3.1.
As before, since S n+1 − S n ≤ k − 1 for all n ∈ N, it is clear that in any interval of r steps in which our observable lies in the range [ /2, ], there must exist at least r/2 steps in this interval where our observable decreases by at most k − 1. Consequently, if F n ( ) holds, then there must exist at least r/2 different values of t ∈ {n − r, n − r + 1, . . . , n − 1} for which the event J t ( ) holds. Using the Markov property, we deduce from Claim 4.4 that if µ is Hölder continuous, then we have P(F n ( )) ≤ r r/2 C c r/2 for all n ∈ N. It follows that for all m ≥ 0, we have where = km 2 /2, r = /2k and C, c > 0 are constants depending on d, k and µ alone; a simple calculation using (7) shows that D(T ) = O((log T / log log T ) 1/2 ) with probability at least 1 − T −2 ; the required bound (2) in the case where µ is Hölder continuous now follows from the Borel-Cantelli lemma.
Note that under the inner product rule, the vector V n+1 is only ever assigned to a bin B i if B n i lies on the convex hull of the set {B n 1 , B n 2 , . . . , B n k }. Much is known about the convex hulls of random subsets of R d (see [7], for example), and it seems possible to us that the Hölder condition in Theorem 2.2 could be relaxed by carefully tracking the convex hull of the bins. However, we cannot altogether do away with some sort of 'well-behavedness' condition: the inner product rule does not match the lower bound in Proposition 2.1 in general, as evidenced by Proposition 2.3 which we prove below.
Proof of Proposition 2.3. Given ω : R >0 → R >0 that is both increasing and unbounded, we first fix a fast-growing sequence of 'length-scales'. More precisely, we fix a sequence L = (L s ) s≥1 of positive reals such that for all s ∈ N, we have L s ≥ 2 and ω exp s 2 L 2 s − 1 ≥ 10s.
Writing T s = exp(s 2 L 2 s ) , this construction ensures that we have for each s ∈ N. Having constructed L, we define an atomic probability distribution µ L on B 2 with weight (6/π 2 )s −2 on the vector (1/L s , −1/2) for each s ∈ N.
We now define µ ω as follows. A vector drawn from µ ω is the vector (0, 1/2) with probability 1/3, uniformly distributed on B 2 with probability 1/3, and distributed according to µ L with probability 1/3. We shall show, using an argument analogous to the one used to prove Proposition 2.1, that if we partition V(µ ω ) into two bins B 1 and B 2 using the inner product rule, then for all sufficiently large s ∈ N. It is then clear that almost surely, We now prove that (8) holds for all sufficiently large s ∈ N. To this end, fix s ∈ N and write L = L s and T = T s = exp(s 2 L 2 ) . Also, let δ n = B n 1 − B n 2 for all n ∈ N. As before, we break the set [T ] into r = T /L 2 disjoint blocks T 1 , T 2 , . . . , T r each of length L 2 ; in other words, for i ∈ [r], we have T i = {t + 1,t + 2, . . . ,t + L 2 }, where t = (i − 1)L 2 . We say that a block T is good if δ n ≥ L/10 for some n ∈ T . For i ∈ [r], writing t = (i − 1)L 2 , we denote by F i the event that δ t < L. We deduce (8) from the following claim.
Proof. We bound the probability of a block being good by showing that in the span of a block, there is a reasonably good chance of walking, using an alternating sequence of the vectors (0, 1/2) and (1/L, −1/2), a distance of about L/10 to the right starting from somewhere close to the origin. We make this precise below.
Writing t = (i − 1)L 2 , first consider the event E 1 that there exists a time P ∈ {t + 1,t + 2, . . . ,t + 10L} at which we have δ P ≤ 10. We claim that indeed, this follows from the fact that for all n ∈ N, we crudely have because V n+1 is sampled from the uniform distribution on B 2 with probability 1/3.
Next, let S = {(x, y) : 0 ≤ x ≤ 1 and − 1/4 ≤ y ≤ 0} ⊂ R 2 and consider the event E 2 that there exists a time Q ∈ {P, P + 1, . . . , P + 10 5 } at which we have δ Q ∈ S. It is not hard to check that again, we crudely have To see this, note that if P exists, then it is possible to walk, while 'respecting the inner product rule' throughout, from δ P to the set S using vectors of norm 1/2 in at most 100 steps; the claimed bound then follows by 'enlarging' such a walk and using the uniform component of µ ω . Finally, consider the event E 3 that the vectors V Q+1 ,V Q+2 , . . . ,V Q+L 2 /5 are alternately the vectors (0, 1/2) and (1/L, −1/2). It is easy to see from the definition of µ ω that Since δ Q ∈ S under E 2 ∩ E 1 ∩ F i , if E 3 also holds, then a simple calculation shows that we have δ t+1 = δ t +V t+1 for each t ∈ {Q, Q + 1, . . . , Q + L 2 /5 − 1} under the inner product rule; consequently, under E 3 ∩ E 2 ∩ E 1 ∩ F i , we have δ Q+L 2 /5 = δ Q + (L/10, 0). It is now clear, provided s ∈ N is sufficiently large, that we have with room to spare.
Using the Markov property, we now deduce from Claim 4.5 that where the last inequality holds provided s is sufficiently large as T = exp(s 2 L 2 ) . It is now clear that (8) holds for all sufficiently large s ∈ N; the proposition follows.

The best-of-two rule
In this section, we prove Theorem 2.4. Before turning to the proof, let us recall the following classical concentration inequality due to Azuma and Hoeffding.
Proposition 5.1. Let (X t ) t≥0 be a supermartingale such that |X t − X t−1 | ≤ C for all t ≥ 1. For all positive integers N and all m ≥ 0, we have Armed with the Azuma-Hoeffding inequality, we are ready to prove Theorem 2.4.
Proof of Theorem 2.4. To prove the result, we shall first show that the distance between any pair of bins is 'self-correcting' under the best-of-two rule, and then use martingale techniques to track these distances. We proceed by induction over the dimension. Consider the function f µ : We claim that it suffices to prove the result in the case where C µ > 0. Indeed, if C µ = 0, then since f is continuous and S Let δ n = B n 1 − B n 2 and let A n = δ n 2 . Our first task will be to estimate the conditional expectation we do this as follows.
Recall that given V n+1 , we choose two bins B i and B j uniformly at random from the set of all bins (without replacement) and assign V n+1 to B i , say, if V n+1 , B n i ≤ V n+1 , B n j . Let E + denote the event that the two bins chosen at time n + 1 are precisely B 1 and B 2 and let E − denote the event that neither of these two bins is chosen at time n + 1. Also, for i ∈ {3, . . . , k}, let E i denote the event that two bins chosen at time n + 1 are B i and one of B 1 or B 2 .
First, writing e n for the unit vector in the direction of δ n , we have Next, as the bins B 1 and B 2 are untouched at time n + 1 under E − , we also have E[A n+1 − A n | A n , E − ] = 0. Finally, we observe the following.
Proof. To simplify notation, let V = V n+1 , U 1 = B n 1 , U 2 = B n 2 and U i = B n i . To prove the claim, we decompose E i into the events First, as the vector V is deterministically assigned to the bin B i under E i (−, −), Indeed, under E i (−, +) for example, it is clear that the best-of-two rule always assigns V to either B i or B 2 (but never to B 1 ); since we also have V,U 1 −U 2 > 0 under E i (−, +), the claim follows. Finally, under E i (+, +), the best-of-two rule never assigns V to B i , and V is equally like to be assigned to either B 1 or B 2 because each of these bins is equally likely to be the other bin selected in addition to B i . Therefore, E[A n+1 − A n | δ n , E i (+, +)] = E V 2 + (1/2)E 2 A n V, e n | δ n , E i (+, +) + (1/2)E 2 A n −V, e n | δ n , E i (+, +) , and consequently, Putting these facts together, it follows that proving the claim.
It is now clear that E[A n+1 − A n | A n , (E + ) c ] ≤ 1. As P(E + ) ≥ 1/k 2 , we deduce from (9) that where C = 2C µ /k 2 ≤ 1 is a positive constant depending on d, k and µ alone.
With the benefit of hindsight, let m = (100 log T /C) 2 and denote by F the event that A n > 2m for some n ∈ [T ]. To bound P(F) from above, we define a collection of stopping times as follows. Let L 0 = 0 and for each j ∈ N, let 1. U j = inf{n : n ≥ L j−1 and A n ≥ m}, and 2. L j = inf{n : n ≥ U j and A n < m}.
If F holds, then it is clear that there exists a j ∈ N such that A n > 2m for some n ∈ [U j , L j ∧ T ]. Let F j denote the event that there exists an n ∈ [U j , L j ] such that A n > 2m and note, by the union bound, that P(F) ≤ ∑ T j=1 P(F j ). Therefore, to complete the proof, it suffices to show that P( For concreteness, we show that P(F 1 ) = o(T −3 ); the same argument may be used to bound P(F j ) for any j ≤ T . In what follows, all inequalities will hold provided T (and hence m) is sufficiently large. Writing U = U 1 and L = L 1 , we define another stopping time N = L ∧ inf{n : n ≥ U and A n > 2m}.
Clearly, P(F 1 ) = P(N < L). Now, set = C √ m/2 and consider, for t ≥ 0, the process First, note that for each t ∈ [0, N − U), where the inequality above is immediate from the definition of N. Next, we also have for each t ∈ [0, N − U), where the last inequality follows from (10) and the definitions of U and L. It is now clear that (X t ) t≥0 with t ∈ [0, N − U] is a supermartingale with increments bounded by 4 √ m. Therefore, for any N ∈ [0, L − U], by the Azuma-Hoeffding inequality, we have where the last inequality holds uniformly in N. By applying the union bound over the (at most T ) possible values of N, we obtain that P(F 1 ) = o(T −3 ). This completes the proof of Theorem 2.4.

Conclusion
First, it would be nice to know under what conditions (2) holds in general. We have proved this estimate for probability distributions satisfying a Hölder condition. At the other end of the spectrum, the same bound also holds for probability distributions supported on a finite number of atoms; in fact, it can be shown in this case that under the inner product rule, we deterministically have D(T ) = O(1). We know from Proposition 2.3 that the inner product rule does not match the lower bound in Proposition 2.1 in general, however. Next, it is worth mentioning that the construction in Proposition 2.3 was designed specifically to be 'bad' for the inner product rule; in particular, this construction does not improve on the strategy-agnostic lower bound in Proposition 2.1. It is therefore an intriguing problem to decide the following: given a probability distribution on the unit ball, does there exist a (distribution-specific) partitioning strategy that matches the lower bound in Proposition 2.1 to within a constant factor? Of course, one can also ask the following (perhaps more difficult) question: is there a universal strategy that matches the lower bound in Proposition 2.1 for every probability distribution on the unit ball?
Finally, it would also be good to improve the implicit constants in our results and quantitatively understand the influence of the number of bins on the problems at hand; indeed, it is natural to expect that the freedom to use more bins should offer better control. A careful analysis of our proofs shows that for the uniform distribution, the lower bound in Proposition 2.1 and the upper bound in Theorem 2.2 differ by a multiplicative factor of k, roughly; bridging this gap remains an interesting problem.