Biased Halfspaces, Noise Sensitivity, and Local Chernoff Inequalities

A halfspace is a function f : {−1,1}n→{0,1} of the form f (x) = 1(a · x > t), where ∑i ai = 1. We show that if f is a halfspace with E[ f ] = ε and a′ = maxi |ai|, then the degree-1 Fourier weight of f is W 1( f ) = Θ(ε2 log(1/ε)), and the maximal influence of f is Imax( f ) = Θ(ε min(1,a′ √ log(1/ε))). These results, which determine the exact asymptotic order of W 1( f ) and Imax( f ), provide sharp generalizations of theorems proved by Matulef, O’Donnell, Rubinfeld, and Servedio, and settle a conjecture posed by Kalai, Keller and Mossel. In addition, we present a refinement of the definition of noise sensitivity which takes into consideration the bias of the function, and show that (like in the unbiased case) halfspaces are noise resistant, and, in the other direction, any noise resistant function is well correlated with a halfspace. Our main tools are ‘local’ forms of the classical Chernoff inequality, like the following one proved by Devroye and Lugosi (2008): Let {xi} be independent random variables uniformly distributed in {−1,1}, and let ai ∈ R≥0 be such that ∑i ai = 1. If for some t ≥ 0 we have Pr [∑i aixi > t] = ε , then Pr[∑i aixi > t +δ ]≤ 2 holds for δ ≤ c/ √ log(1/ε), where c is a universal constant. ∗Department of Mathematics, Bar Ilan University, Ramat Gan, Israel. nathan.keller27@gmail.com. Research supported by the Israel Science Foundation (grants no. 402/13 and 1612/17) and the Binational US-Israel Science Foundation (grant no. 2014290). †Department of Mathematics, Bar Ilan University, Ramat Gan, Israel. ohadkel@gmail.com. c © 2019 Nathan Keller and Ohad Klein cb Licensed under a Creative Commons Attribution License (CC-BY) DOI: 10.19086/da.10234 ar X iv :1 71 0. 07 42 9v 3 [ m at h. C O ] 2 5 Se p 20 19 NATHAN KELLER AND OHAD KLEIN


The maximal influence of halfspaces
The influence of the kth coordinate on a Boolean function f is defined as where x ⊕ e k is obtained from x by flipping the kth coordinate. The total influence of f is I( f ) ∑ k I k ( f ).
At first sight, it may seem that the kth influence of a halfspace 1(∑ i a i x i > t) is 'proportional' to the weight a k . However, this is not the case; for example, the halfspace f = 1( 4 5 x 1 + 3 5 x 2 > 0) is equal to the dictator function f (x) = 1(x 1 > 0), and the influence of the second coordinate on it is zero. Hence, it is desirable to find a relation between the influences and the weights, to the extent that such a relation exists.
In [33,Theorem 36], Matulef et al. proved a lower bound on the maximal influence of halfspaces: and used it as another central component in their algorithm for testing halfspaces. The authors of [33] conjectured that the lower bound can be improved to Ω(max i {a i }E[ f ]). This conjecture was later proved by Dzindzalieta and Götze [13]. We determine the exact asymptotic order of the largest influence of a halfspace: Theorem 1.4. There exist universal constants c 1 , c 2 such that for any halfspace f = 1(∑ i a i x i > t) with E[ f ] ≤ 1 2 and a 1 ≥ a 2 ≥ . . . ≥ a n ≥ 0, we have In view of the aforementioned example, even the fact that there at all exists a fixed relation between the maximal influence of a halfspace and its largest weight is perhaps somewhat surprising.

The vertex boundary of halfspaces
A halfspace f naturally corresponds to the set 1 f = {x : f (x) = 1}, which may be viewed as a subset of the discrete cube graph. A natural isoperimetric question one may ask is: what is the relation between the size of this set (which is, of course, 2 n · E[ f ]), and the size of its boundary? In finite graphs, there are two classical types of boundary of a set S: the edge boundary, which consists of the edges that connect a vertex in S with a vertex in the complement of S, and the vertex boundary, which consists of the vertices in S that have a neighbor outside S (or, vice versa, of the vertices outside S that have a neighbor in S).
It is easy to see that the size of the edge boundary of the set 1 f is equal (up to normalization) to the total influence I( f ) ∑ k I k ( f ), and thus is usually easier to deal with. We show that for halfspaces, the asymptotic size of the vertex boundary ∂ (1 f ) admits a nice expression in terms of E[ f ] and the maximal weight |a 1 |. Theorem 1.5. There exist universal constants c 1 , c 2 such that for any halfspace f = 1(∑ i a i x i > t) with 2 and a 1 ≥ a 2 ≥ . . . ≥ a n ≥ 0, we have The theorem is proved by showing that for halfspaces, the vertex boundary is approximately equal to the largest influence, and then applying Theorem 1. 4. We note that other relations between the measure of the vertex boundary and influences were obtained by Talagrand [50].

Noise sensitivity of biased functions and correlation with halfspaces
A Boolean function is called noise sensitive if flipping each of its input bits with a small probability affects its output 'significantly'. Otherwise, it is called noise resistant. Formally, the noise stability of a function f at noise rate 1 − ρ is defined as where y is obtained from x by independently keeping each coordinate of x unchanged with probability ρ, and replacing it by a random value with probability 1 − ρ. A sequence of functions { f m : {−1, 1} n m → {0, 1}} is called asymptotically noise sensitive if for any constant ρ ∈ (0, 1), we have lim m→∞ S ρ ( f m ) = 0. For the sake of simplicity, we consider a single function f and say that it is noise sensitive if S ρ ( f ) = o n (1), and is noise resistant otherwise. Noise sensitivity is a fundamental property of Boolean functions that has been studied extensively over the last two decades. Its applications span several areas, including machine learning (e.g., [11,32]), hardness of approximation (e.g., [31,37]), percolation theory (e.g., [19,44]), and social choice theory (e.g., [26,37]).
A main result of the seminal work of Benjamini, Kalai and Schramm [2] that initiated the study of noise sensitivity, is that noise resistance is closely related to strong correlation with a halfspace, and to a property of the Fourier expansion. Specifically, they showed the following result. (b) Any unbiased halfspace is noise resistant (and actually, satisfies a stronger property called 'noise stability').
(c) For any noise resistant monotone Boolean function f , there exists an unbiased halfspace g such that Cov( f , g) = Ω(1).
We note that in the non-monotone case the situation is more complex. Indeed, as was shown recently by Mossel and Neeman [36], even the stronger assumption that f is noise stable is not sufficient for guaranteeing the existence of a halfspace g such that Cov( f , g) = Ω(1).
The definition of noise sensitivity is 'not interesting' for highly biased functions (i.e., when E[ f ] is close to 0 or to 1), as any such function is clearly noise sensitive. Hence, it is natural to ask what should be the 'right' definition of noise sensitivity for highly biased functions. Inspired by Theorem 1.6, we propose a 'Fourier-theoretic' definition.
Note that Theorem 1.6(a) asserts that an unbiased monotone function is noise resistant if and only if its first-degree Fourier weight is, up to a constant factor, the maximum possible. For general functions, the aforementioned 'level-1 inequality' asserts that W 1 ). Based on this, we say that f is noise resistant if W 1 ( f ) is within a constant factor of the maximum possible. Formally: for some universal constant c. Theorem 1.1 allows us to claim that with respect to this definition, the close relation between noise resistance and strong correlation with a halfspace holds also for biased functions. Indeed, one direction (i.e., that any halfspace is Fourier noise resistant) is exactly the assertion of Theorem 1.1. In the converse direction, Mossel and Neeman [36,Proposition 3.2] showed that for any Boolean function f , there exists a halfspace g such that Cov( f , g) ≥ Ω( . We show the following sharp bound, which is always stronger than the bound of [36] by the level-1 inequality.
Theorem 1.8. For any Boolean function f , there exists a halfspace g such that where c is an absolute constant. In particular, if f is Fourier noise resistant and E[ f ] ≤ 1/2 then there exists a halfspace g such that Cov( f , g) = Ω(E[ f ]).
Note that the correlation asserted in the theorem is clearly within a constant factor of the maximum possible, as Cov( f , g) ≤ E[ f ] for any f , g. An interesting feature of Theorem 1.8 is that unlike the classical result of [2], the strong correlation with a halfspace is guaranteed even if the function f is not monotone. This is somewhat surprising, as most known correlation bounds (such as FKG-type inequalities [17]) hold only for monotone functions.
Finally, we show that for monotone functions, strong correlation with a halfspace is implied also by a 'probabilistic' notion of noise resistance. Here, the rate of noise we consider is 1 − c/ log(1/E[ f ]), for a fixed 'small' constant c (i.e., ρ = c/ log(1/E[ f ])). It is easy to show (see Section 9) that for this noise rate, any function f satisfies Recalling that the classical definition of noise resistance is S ρ ( f ) = Ω(1), which is within a constant factor of the maximal possible value, a natural definition of noise resistance in our setting is the requirement Proposition 1.9. There exists a universal constant c > 0 such that for any monotone function f : , and consequently, there exists a halfspace g such that Cov( f , g) = Ω(E[ f ]).

Local Chernoff Inequalities
Tail estimates for weighted sums of independent random variables are among the most frequently used probabilistic tools in combinatorics and theoretical computer science. A standard example is Hoeffding's inequality which asserts that if {x i } n i=1 are independent mean-zero random variables with ∀i : |x i | ≤ 1 and {a i } n i=1 are real numbers that satisfy ∑ i a 2 i ≤ 1, then for any t > 0, In the commonly-studied case where each x i is uniformly distributed in {−1, 1} (also called Rademacher random variables), stronger bounds can be obtained, which essentially state that ∑ a i x i is distributed 'like' a Gaussian random variable. In particular, there exists a constant c such that for any t > 0, where Z ∼ N(0, 1). (This is a result of Eaton [14]; the 'correct' value of c was recently determined by Bentkus and Dzindzalieta [5] to be ≈ 3.178.) This phenomenon is also demonstrated by the Central Limit Theorem, or its more quantitative form, the Berry-Esseen Theorem (see, e.g., [16]), which implies that for any interval I, where c is an absolute constant. (The claim holds, e.g., for c = 1; the best currently known bound on c was obtained by Shevtsova [47]). The 'local Chernoff inequalities' we consider in this paper assert that the rate of decay of Pr[∑ a i x i > t] as a function of t is also essentially equal to that of a Gaussian random variable Z ∼ N(0, 1).

A local Chernoff inequality of Devroye and Lugosi, via a general method of Benjamini, Kalai, and Schramm
In a remark ending their seminal paper on the variance of first passage percolation [3], Benjamini et al. suggested a general method for deriving 'local' tail estimates for random variables from hypercontractive inequalities. Essentially, in order to obtain a local tail estimate for f , one considers the function Then one uses a theorem of Talagrand [48] (Theorem 4.1 below, whose proof relies on hypercontractivity) to show that Var(g t ) is 'small' (as a function of ε), and deduces an upper bound on the minimal δ such that Pr( f > t + δ ) ≤ ε/2 using Chebyshev's inequality. In [10], Devroye and Lugosi developed the method of [3] and used it to obtain various local tail bounds. In particular, applying the method to the function f = ∑ a i x i , they proved the following tail estimate: Theorem 1.10 (Devroye and Lugosi, 2008). Let {x i } be independent random variables uniformly distributed in {−1, 1}, and let a i ∈ R ≥0 be such that ∑ i a 2 i = 1. There exists a universal constant c > 0 such that if t ≥ 0 and ε = Pr This shows that the 'relative' decay of the tail probability Pr [∑ i a i x i > t] is essentially equal to that of a Gaussian random variable Z. Indeed, an easy computation yields that if for some t ≥ 0 we have Pr[Z > t] = ε, then the minimal δ such that Pr[Z > t + δ ] ≤ ε/2 is of order Θ 1/ log(1/ε) . Theorem 1.10 implies that if for some t > 0, the probability Pr[∑ i a i x i > t] is much smaller than the Gaussian-like bound provided by (4), then for any t > t, the probability Pr[∑ i a i x i > t ] will 'remain' much smaller than that of a Gaussian random variable. The theorem is tight up to a constant factor, e.g., for X = ∑ n i=1 1 √ n x i where n is sufficiently large; this follows immediately from (5), using the exact rate of decay of the Gaussian distribution.
Following the notation of [10] where such estimates were called 'local tail bounds', we refer to Theorem 1.10 and its variants as local Chernoff inequalities. 1

Refined variants, via log-concavity
For our applications, we will need a refined inequality, which takes into consideration the weights a i : , and let δ be minimal such that If B, S is any partition of {1, 2, . . . , n} (which corresponds to 'big' and 'small' values of the a i 's), then one of the following holds: We also prove the following inequality, which applies in the slightly more general case of bounded symmetric random variables: Theorem 1.12. Let X = ∑ x i , where {x i } are independent symmetric (around 0) random variables with |x i | ≤ a i almost surely, and let F(t) = Pr [X > t]. Set m = 2 max i {a i }, and let c ∈ (0, 1). If ε = F(t) for some t ≥ 0, and δ ≥ 0 is minimal such that F(t + δ ) ≤ c · F(t), then we have The main tool in the proof of Theorem 1.12, which we also use to present an alternative proof of Theorem 1.10, is a 'relaxed log-concavity' lemma: Lemma 1.13. Let X = ∑ i x i be a sum of independent real random variables, and denote 1 We note that possibly, the name 'Hoeffding' should be used here instead of 'Chernoff'. However, as it is quite common to call all results of this type 'Chernoff-type inequalities', we prefer to use this name. We prove the lemma by constructing an explicit measure-preserving injection from the event where X 1 , X 2 are two identical, independent, copies of X. The idea is to swap an appropriate fraction of X 1 and X 2 , in a way that increases X 2 , at the expense of decreasing X 1 . Theorem 1.11 follows from Lemma 1.13 and Hoeffding's inequality via some technical computations.
We believe that these 'local' tail estimates and their variants, as well as the log-concavity lemma, will be useful in other contexts as well.

Organization of the paper
This paper is organized as follows. In Section 2 we present notation and conventions to be used throughout the paper. In Section 3 we prove Lemma 1.13 and another concentration lemma, and in Section 4 we use these lemmas to prove the local Chernoff inequalities (namely, Theorems 1.10, 1.11 and 1.12). Our results on the first-degree Fourier weight (Theorem 1.1), the maximal influence (Theorem 1.4), the vertex boundary size (Theorem 1.5), and the kth-degree Fourier weight (Theorem 1.3) of halfspaces are presented in Sections 5, 6, 7, and 8, respectively. Finally, we study noise sensitivity of biased functions and prove Theorem 1.8 and Proposition 1.9 in Section 9.

Conventions
In this section we present notation and conventions that we will use throughout the paper.

For a Boolean function
We say that f is almost unbiased if c ≤ E[ f ] ≤ 1 − c for a universal constant c (whose exact value does not matter). On the other hand, f is said to be strongly biased if (1). Notice these notions formally make sense only for families of Boolean functions. However, as we use them only for enhancing intuition, we skip over the accurate formulation when the meaning is clear from the context.

2.
A halfspace is a Boolean function of the form f = 1{a · x > t} with a ∈ R n and t ∈ R. We are always going to assume that a i ∈ R ≥0 and that a 1 ≥ . . . ≥ a n > 0. Furthermore, we frequently assume µ( f ) ≤ 1 2 ; this mostly does not affect generality since we can alternatively investigate the dual function g(x) = 1 − f (−x) which shares many properties with f (note that g is also a halfspace).

3.
We sometimes identify the halfspace f with 2 f − 1 = sgn(a · x − t), where sgn is the sign function, and we choose sgn(0) = −1. Furthermore, since there are only finitely many (2 n ) values for a · x, we may increase t a little without changing f . Moreover, notice that as long as we are interested in a particular halfspace f = 1{a · x > t}, we may assume that a · x does not assume any finite set of values since slightly altering a does not change f .

4.
For an a ∈ R n ≥0 as above and an s ∈ R we write f s (x) = 1{a · x > s}. (Notice that the notation f s (x) is used only for halfspaces, where s always denotes the threshold). Additionally, we use the notation F(t) = Pr x [a · x > t], and so, F(t) = µ( f t ). Furthermore, we regularly write ε µ( f t ) F(t).

5.
The letters β , γ, δ are usually used to describe a significant decay of F; e.g., in several places β is chosen to be the minimal positive real value satisfying F(t + β ) ≤ 1 3 F(t).

Two Concentration Lemmas
In this section we prove two concentration results concerning sums of independent random variables. These lemmas are central tools in the proof of the local Chernoff inequalities in Section 4, and are also used in other proofs in the sequel. The first is Lemma 1.13, which asserts that if are independent real random variables, and we denote Notice the (−m) in Equation (6) can not in general be omitted. This is because b, c, d might not be 'aligned' to values achievable by X. To compare, in the context of the usual notion of discrete log-concave distribution, the underlying random variable assumes only integer values. This does not capture the behavior of the variables X discussed in Lemma 1.13.
The second result is a concentration lemma which assumes (in addition) that the random variables are symmetric.
Lemma 3.1. Let x i be independent symmetric (around 0) random variables with |x i | ≤ a i , almost surely. For any m ≥ max i {a i } and any s,t such that 0 ≤ s ≤ t, we have Note that while it may seem that the lemma 'should' hold with the constant in the right hand side of (7) equal to 1, we show below by an explicit example that this constant must be at least 2.
We prove both lemmas constructively using injective measure preserving maps from the set of events that represent the l.h.s. into the set of events that represent the r.h.s. We introduce several injective transformations which we will use to construct the maps in Section 3.1 and present the proof of the lemmas in Section 3.2.

Auxiliary injective transformations
Definition 3.2 (Prefix/suffix flip). Let r be a real number, and let u, v ∈ R n be two vectors whose 'partial sums of differences' That is, we choose the first index t in which the partial sum S t (u, v) exceeds r, and interchange the coordinates of the vector (u, v) in all indices later than t. (This is called a 'suffix flip'). For r ∈ R and a single vector u ∈ R n such that max k∈[n] {S k (u, −u)} ≥ r, we define PF r (u) to be the unique v ∈ R n that satisfies (−v, v) = SF r (u, −u).
That is, we choose the first index t in which the partial sum ∑ i≤t u i exceeds r/2 and flip all coordinates of u with indices no later than t. (This is called a 'prefix flip').
Notice that the suffix flip is defined (in particular) for any (u, v) such that ∑ u i ≥ ∑ v i + r, and the prefix flip is defined (in particular) for any u such that ∑ u i ≥ r/2. Also, notice that SF r (u, v) is an involution (meaning that SF r (SF r (u, v)) = (u, v)), and hence, is injective. Moreover, as PF r is a composition of SF r (restricted to inputs of the form (u, −u)) with the map x → (−x), it is injective as well.
Remark. We note that prefix/suffix flips are similar to the classical André's reflection method ( [1]; see also [43]), extensively used in enumerative combinatorics and other fields. . Let x ∈ {−1, 1} n be a vector whose 'partial sums' That is, we flip only a single coordinate of x -the first among the indices k for which the partial sum S k (x) is maximal.
The map SCF is invertible, and thus, injective. To see this, note that if in the map x → SCF(x), the tth coordinate was flipped, then the latest index in which the maximum of {S k (SCF(x))} is attained is t − 1. Indeed, by the definition of SCF, we have S t−1 (SCF(x)) = S t (x) − 1, while for any t < t we have S t (SCF(x)) ≤ S t (x) − 1 and for any t ≥ t we have S t (SCF(x)) ≤ S t (x) − 2.
In particular, we can define an inverse mapping ISCF as follows. Take t = max{i ∈ {0, . . . , n − 1} | ∀ j : S i (x) ≥ S j (x)} (where S 0 (x) 0) and set We would like to use the map SCF not only for x's such that max k S k (SCF(x)) > 0, but also for x's for which we only know that ∑ a i x i > 0 for some non-negative weights a 1 , . . . , a n . For this, we define the following variant which reorders the coordinates of x according to the sizes of the a i 's, applies SCF, and then reorders the coordinates back.
Note that for any a ∈ R n ≥0 , we have ISCF a • SCF a = id dom(SCF a ) , and hence, SCF a is injective. We claim that the function SCF a (x) is defined (in particular) for all x's such that ∑ a i x i > 0. To see this, note that after the re-ordering of the coordinates of x, the function SCF is applied on a vector x that satisfies ∑ a i x i > 0, where a 1 ≥ . . . ≥ a n (of course, the a i 's are a reordering of the a i 's). The only reason SCF(x ) might not be defined, is if ∀k : S k (x ) = ∑ k i=1 x i ≤ 0. But this cannot happen, from the following Abel's-summation argument: where the last inequality follows from the, apparently wrong, assumption ∀k : S k (x ) ≤ 0.

Proof of Lemmas 1.13 and 3.1
Now we are ready to present the proofs of the lemmas. Lemma 1.13. Let X = ∑ i x i be a sum of independent real random variables, and denote Proof. After subtracting F(c)F(d) from both sides of Inequality (6), it is left to prove that Let Ω be the underlying probability space over which {x i } i and X are defined. Without loss of generality, ω → (x i (ω)) n i=1 is an injective map. We define an injective measure-preserving map which takes as input a pair (ω 1 , ω 2 ) ∈ Ω 2 for which X (ω 1 ) > d and X (ω 2 ) ∈ (b, c] and outputs (δ 1 , δ 2 ) ∈ Ω 2 for which X (δ 1 ) > c and X (δ 2 ) ∈ (b + d − c − m, d]. This will clearly conclude the proof. Set r = (d − c) − m (note that if r ≤ 0 then the assertion holds trivially), and consider by swapping some pairs of elements, δ 1 , δ 2 are well-defined as well. Second, ψ is measure-preserving, as the variables x i are independent, and SF r just swaps pairs of identically distributed variables in its input. Furthermore, ψ is injective, since SF r is invertible from the left (as noted after Definition 3.2).
Hence, it remains to show that the range of ψ is included in the space represented by the r.h.s. of (8), i.e., that X(δ 1 ) > c and X( Observe that (unless r < 0, in which case the assertion of the lemma is trivial), we have according to how SF r is defined -swapping the suffix just after the first index t 0 for which (Here we also use the fact that each difference is bounded by m, and thus we have as required. We will do this with an injective measure-preserving map. Let r = t − s − 2m (note that the assertion holds trivially if r ≤ 0). Denote u i = x i , and define v = PF r (u). By the definition of PF r , for any u we have is injective by Definition 3.2, and is measure-preserving as it only negates some x i 's, which we assumed are symmetric random variables. Therefore, (9) holds, and thus, it is sufficient to show that For this, we construct four injective, measure-preserving maps from sub-events of The maps we are going to construct will only negate some of the x i 's, so B, S are reconstructible from the output of any of the maps. Thus, it is sufficient to show that the maps are injective given the partition B, S. Moreover, these maps will be measure-preserving as they only negate input variables.

The first map ψ 1 is defined on inputs
. We set r = m, apply PF r (x i ) i∈S on the S-coordinates, and leave the B-coordinates unchanged. The map is injective as PF r is. Also, the output y = ψ 1 (x) satisfies ∑ i∈[n] y i ∈ (s − m, s + m] by (10) and the definition of S.

The second map ψ 2 is defined on inputs
. We set r = 2m, apply PF r (x i ) i∈S on the S-coordinates, and leave the B-coordinates unchanged. The map is injective and the output y = ψ 2 (x) satisfies ∑ i∈[n] y i ∈ (s − m, s + m] exactly like in the previous case.

The third map ψ 3 is defined on inputs
Notice that for such inputs, we have ∑ i∈B x i > 0. First, for every i ∈ B, we extract s i = sgn(x i ) and b i = |x i |. Then, we treat σ {s i } i∈B as a {−1, 1}-valued vector and {b i } i∈B as a vector of weights. Since we can apply the map SCF b to the vector σ (as noted right after Definition 3.4), to obtain σ = SCF b (σ ). Finally, we define the output y = ψ 3 (x) by ∀i ∈ B : y i = σ i b i and ∀i ∈ S : y i = x i . That is, we apply a single coordinate flip to the vector (x i ) i∈B , and leave the S-coordinates unchanged. The map is injective since SCF b is. By the definitions of SCF and of B, we have (as SCF flips a single coordinate whose value is between m/2 and m, being taken from B). Therefore, the output might not satisfy ∑ i∈[n] y i ∈ (s − m, s + m]. If it does not, we apply the next map ψ 4 on our intermediate "output" y.

The output of the previous map
. In this case we set ψ 4 (x) ψ 3 (y). Note that y satisfies the conditions under which ψ 3 is defined. Indeed, we have ∑ i∈S y i < s + m since ψ 3 does not alter S-coordinates, and ∑ i∈[n] y i > s + m, as by (12) we have Indeed, applying again (12) we see that on the one hand, and on the other hand, This completes the proof of the lemma. Two remarks are due.
1. One may wonder whether the constant 5 in Inequality (7) can generally be improved. We believe the correct value is 2; it surely cannot be less. Indeed, consider X = ∑ i∈[n] x i where x i ∼ {−1, 1} independently from each other. Trivially, X assumes only values equal to n (mod 2). Hence, taking n to be a large odd integer and (m = 1.5, s = 1,t = 2), we have 2. One may also wonder whether Inequality (7) can be strengthened in the case where m is large, so that the constant 5 is replaced by 1 + O(max a i /m). This can indeed be done, by slightly modifying the proof of Lemma 3.1. Specifically, in the beginning of the proof we may define r = t − s − 2 max i {a i }, and then due to (10), instead of (9) we get Hence, it is sufficient to prove that and this indeed follows immediately by invoking Lemma 3.1 itself as a black-box.
The following corollary of Lemma 3.1 will be used several times in the sequel, so for the sake of convenience we state it explicitly. Corollary 3.5. Let f s = 1{a · x > s} be a family of halfspaces and suppose a 1 ≥ a 2 ≥ . . . a n ≥ 0. Then for any s,t ∈ R such that |s| ≤ t, we have 5I 1 ( f s ) ≥ I 1 ( f t ).
Hence, for 0 ≤ s < t, the corollary follows immediately from Lemma 3.1. To prove the assertion for s < 0 < t, note that if s = −s, we have Hence, it is sufficient to prove a variant of Lemma 3.1 in which the assertion is replaced by

Local Chernoff Inequalities
In this section we prove our local Chernoff inequalities, namely, Theorems 1.10, 1.11, and 1.12. First, for the sake of completeness we present the proof of Theorem 1.10 using the general method of Benjamini et al. [3], due to Devroye and Lugosi [10], and then we present a proof of all three theorems, via Lemma 1.13.

Proof of Theorem 1.10, using the Benjamini-Kalai-Schramm method
Let us recall the formulation of Theorem 1.10.
To prove the theorem, one needs the following result of Talagrand [48], whose proof relies on the hypercontractive inequality [6]. Then Proof of Theorem 1.10, due to Devroye and Lugosi [10]. We let g = max(t, a · x) and ε = Pr[g > t], and apply Theorem 4.1 to g. Now, we bound the terms that appear in (14). Firstly, we have Secondly, by the Cauchy-Schwarz inequality, and thus, Substituting into (14) and using the assumption ∑ i a 2 i = 1, we obtain On the other hand, by Chebyshev's inequality, we have Since for t ≥ 0 we have Pr[g = t] ≥ 1/2 ≥ ε/2, we must have E[g] ≤ t + 2 Var(g)/ε, and so we deduce Substituting the bound of (15) on Var(g), we obtain Pr[g > t + δ ] ≤ ε/2 for δ = O(1/ log(1/ε)), as asserted.
4.2 Proof of Theorems 1.10, 1.11, and 1.12, using Lemma 1.13 In the proof, we shall use Lemma 1.13 via the following auxiliary lemma.
Applying repeatedly Lemma 1.13, with d = t + δ + m, c = t, and b taken from the sequence of values b = r, r + δ , r + 2δ , . . . , r + (l − 1)δ , we get a series of inequalities: This completes the proof.
Now we are ready to present the proofs of the theorems. We note that although we already presented a proof of Theorem 1.10 above, we present an alternative proof as well, since it is more constructive and may be applicable in settings where Talagrand's result does not apply.
We begin with a proof of Theorem 1.12.
Proof. Without loss of generality, assume ∑ a 2 i = 1. Denote ε = F(t). We split into two cases.
We now prove Theorems 1.10 and 1.11 together.
Theorems 1.10 and 1.11. Let {x i } be independent random variables uniformly distributed in {−1, 1}, and let a i ∈ R ≥0 be such that ∑ i a 2 i = 1. There exists a universal constant c > 0 such that if t ≥ 0 and Furthermore, if B, S is any partition of [n] (which corresponds to 'big' and 'small' values of the a i 's), We have |B| β 2 ≤ ∑ i∈[n] a 2 i ≤ 1, and so |B| ≤ 1 4 log(1/ε). By the law of total probability, Notice that 2 −|B| ≥ ε 1/4 , and so each of the probabilities on the right hand side is ≤ ε 3/4 . For y ∈ {−1, 1} B , let δ y be minimal such that as asserted.
Now, given a partition B, S of [n], we show that either |B| ≥ 1 2 log(1/ε), or δ ≤ ∑ i∈S a 2 i log(1/ε) . Assume that |B| < 1 2 log(1/ε), and note that by (19), we have ∀y : Pr Let δ y be as in the above proof, and consider the random variable Z = α ∑ i∈S a i z i , for α = (∑ i∈S a 2 i ) −1/2 (which is needed for rescaling the weights to have sum-of-squares equal 1). Applying (20) to Z , with t = α(t − ∑ i∈B a i y i ) in place of t and using Pr[Z < t ] ≤ ε 1/2 which follows from (21), we get and we conclude with δ ≤ O ∑ i∈S a 2 i log(1/ε) since δ ≤ max y δ y , as above.

First-Degree Fourier Weight of Halfspaces
In this section we prove Theorem 1.1 stating that any halfspace ) (which is the maximal possible value up to a constant factor, by the aforementioned level-1 inequality).
We start with an easy lemma describing how 'large coordinates' (i.e., coordinates i for which a i is 'large') influence a halfspace.
(Note that G does not depend on the value of the coordinate x i .) We have By the definition of G, where the last inequality holds since a i > β /2.
We proceed with a lemma which states that in some sense, the influence of a coordinate on a halfspace is 'proportional' to its weight. (Recall that there exist halfspaces for which the weight of some coordinate is positive and nevertheless, it has zero influence; the lemma shows that this 'anomaly' can be fixed by slightly modifying the function.) Proof. By the definition of f t+s , we have Changing the order of integration, we can express e δ i in a different way: (Note that there is no difference here between closed-open segments and open segments, as there is no change in the integral involved.) We may assume δ ≥ a i , for otherwise the statement of the lemma is trivial. An easy computation confirms that for any t, r ∈ R we have So we overall obtain , and so we deduce This completes the proof.
Now we are ready to prove that any halfspace has a 'large' Fourier weight on the first degree.
The main idea of the proof is as follows. We divide the coordinates into a set B of coordinates i whose weight a i is 'large', and a set S of coordinates whose weight is 'small' (the exact definition is given below). We show that either |B| is 'large', and then by Lemma 5.1, the contribution of the coordinates in B is already sufficient to guarantee W 1 ( f t ) ≥ Ω(ε 2 log(1/ε)), or else, the contribution of the coordinates in S will guarantee W 1 ( f t ) ≥ Ω(ε 2 log(1/ε)). To show the latter (which is the more complex case), we note that by the Cauchy-Schwarz inequality, we have Hence, it is sufficient to show that We will do so, using the local Chernoff inequality presented in Section 4. (Notice that the situation ∑ i∈S a 2 i = 0 is contained in the '|B| is large' case.) Let β be minimal such that F(t + β ) ≤ ε/3, and let γ be minimal such that F(t + γ) ≤ ε/6. Denote δ = β + γ, let S = {i ∈ [n] | a i ≤ β } and B = [n] \ S.
Since a i ≤ β , the definition of β and γ implies Applying the 'strong local Chernoff inequality' (Theorem 1.11) to the function f t , with S, B as defined above, we obtain that either |B| ≥ 1 2 lg (1/ε) or γ ≤ O ∑ i∈S a 2 i / log(1/ε) . (Formally, the theorem is applied three times, where Pr[a · x > s] drops from ε to ε/2, ε/4, and then ε/6 as s increases.) In the latter case, we have δ ≤ 2γ ≤ O ∑ i∈S a 2 i / log(1/ε) . We consider three cases: Case 1: |B| is large -specifically, |B| ≥ 1 2 lg (1/ε). Note that by Lemma 5.1, every i ∈ B has I i ( f t ) ≥ 2ε/3. Hence, in this case we have W 1 ( f t ) = ∑ i I i ( f t ) 2 ≥ 2 9 ε 2 log(1/ε), as asserted. Case 2: ε ≥ 1 4 . In this case, we use the aforementioned theorem of [18] which asserts that any halfspace g r : as asserted. Since Recall that by definition, e δ i = E s∼U(0,δ ) [I i ( f t+s )], and thus, by linearity of expectation, we have Hence, (28) implies that there exists s ∈ (0, δ ) with To show that (25) holds, and thus complete the proof of the theorem, it is sufficient to show that the inequality (29) holds also for s = 0. This is achieved in the following proposition.
Proposition. In the former settings, where |B| ≤ 1 2 lg(1/ε) and ε < 1 4 , for any s > 0 we have ∑ i∈S a i I i ( f t ) ≥ ∑ i∈S a i I i ( f t+s ). Proof of the Proposition. We start by showing that ∑ i∈B a i ≤ t. Assume the contrary. We then have Hence, ε ≥ 1/4, which contradicts the assumption.
From the monotonicity of halfspaces, we have Thus, as ∀x : 0 ≤ f t+s (x) ≤ f t (x), it is sufficient to show that ∑ i∈S a i x i ≥ 0 whenever f t > 0. This indeed holds, as where the ultimate inequality holds since ∑ i∈B a i ≤ t. This completes the proof of the proposition, and thus also the proof of the theorem.

The Maximal Influence of Halfspaces
In this section we prove Theorem 1.4.
Theorem 1.4. There exist universal constants c 1 , c 2 such that for any halfspace f = 1(∑ i a i x i > t) with E[ f ] ≤ 1 2 and a 1 ≥ a 2 ≥ . . . ≥ a n ≥ 0, we have

Proof of the Lower Bound
We start with the lower bound, which follows directly from the tools developed in the previous sections.
Since e δ 1 is defined as E s∼U(0,δ ) [I 1 ( f t+s )], this implies that there exists s ≥ 0 with Finally, by Corollary 3.5, for any s > 0 we have 5I 1 ( f t ) ≥ I 1 ( f t+s ). (Note that since µ( f t ) ≤ 1 2 , we may assume t ≥ 0 and so, Corollary 3.5 can indeed be applied.) Hence, as asserted.

Proof of the Upper Bound
To prove the upper bound, we use the following 'reverse' version of Corollary 3.5 which asserts that while I 1 ( f s ) is a decreasing function of s up to a constant factor, the 'normalized' influence I 1 ( f s )/ E[ f s ] is increasing up to a constant factor. Lemma 6.2. Let a ∈ R n ≥0 satisfy a 1 ≥ . . . ≥ a n . For every t ≥ s ≥ 0 with E [ f t ] > 0, we have Proof. For any r 1 , r 2 ∈ R, let G(r 1 ) = Pr x∼{−1,1} n [a · x − a 1 x 1 > r 1 ] and G(r 1 , r 2 ] = G(r 1 )−G(r 2 ). Notice that for any r, we have Hence, in order to prove (31), it is sufficient to show that for every t ≥ s ≥ 0 with E [ f t ] > 0, Let t ≥ s ≥ 0. Lemma 1.13 (with m = 2a 1 ) asserts that s + a 1 ,t + 3a 1 ) and subtract G(c)G(d) from both sides to deduce It follows that .
Adding 1 to both sides we obtain .
Taking reciprocal and using Corollary 3.5, we get and thus, (32) holds, as asserted.

A Corollary of Theorem 1.4
We conclude this section with a corollary of Theorem 1.4 which essentially describes the probability that a linear form l(x) = ∑ a i x i (where x i ∼ {−1, 1} uniformly and independently) lies in some interval (a, b], by means of the tail probability Pr[l(x) > a] and the interval length |I| = b − a. This corollary generalizes [45,Theorem 4], up to the multiplicative constants. We note that one could also prove this result directly, by an argument similar to that of the proof of Theorem 1.4.
It remains to consider the case −m ≤ t < 0. We claim that the assertion in this case follows from the assertion for t = 0. On the one hand, let l (x) be a small enough perturbation of l(x), such that Pr[l (x) = 0] = 0 and the functions 1{l(x) > t} and 1{l(x) > t + 2m} coincide with the functions 1{l (x) > t} and 1{l (x) > t + 2m}, respectively. (Such a perturbation exists, as explained in Section 2). Then, using the symmetry of l (x) and the case t = 0 of (38 In this section we prove Theorem 1.5. Theorem 1.5. There exist universal constants c 1 , c 2 such that for any halfspace f = 1(∑ i a i x i > t) with E[ f ] ≤ 1 2 and a 1 ≥ a 2 ≥ . . . ≥ a n ≥ 0, we have In addition, we show that for halfspaces, VB 1 ( f t ) and VB 0 ( f t ) cannot be too far from each other, while for general Boolean functions they can be 'very' far.
Remark. All the results in this section apply also to halfspaces f t having µ( f t ) ≥ 1 2 . The difference is that µ( f t ) should be replaced by 1 − µ( f t ) and VB 0 ( f t ) exchanges roles with VB 1 ( f t ).

Proof of Theorem 1.5
We start with a proposition, which, together with Theorem 1.4, implies Theorem 1.5.
Since µ( f t ) ≤ 1 2 , we may assume w.l.o.g. t > 0 as noted in Section 2. We observe that, since a i ≥ a k for all k > i, if for some x we have f t (x) = f t (x ⊕ e k ) and i < k satisfies x i = x k , then f t (x) = f t (x ⊕ e i ). Hence, setting so that c k = 1 2 b k (−1, . . . , −1). Note that by the law of total probability, We claim that for any λ ∈ {−1, Hence, Corollary 3.5 (applied to the family of halfspaces {1{∑ n i=k+1 a i x i > s}}, using the assumption t > 0) implies: Overall, we have which completes the proof. Remark. The lower bound of (40) is tight, e.g., for the dictatorship 1(x 1 > 0). As is apparent from (41) We have no conjecture for what the correct upper bound for VB 1 ( f t ) is.

A Relation Between Upper Boundary and Lower Boundary of Halfspaces
We present now an argument similar to Proposition 7.2, which establishes a sharp relation between VB 0 ( f ) and VB 1 ( f ) for halfspaces.
For any halfspace f t with µ( f t ) ≤ 1 2 , we have Remark. Note the left inequality in (42) is tight for 'Hamming balls' with n sufficiently large with respect to t. The right inequality in (42) is tight for 'subcubes' Proof of Proposition 7.3. Let f t = 1{a · x > t} be a halfspace with Pr[ f t = 1] ≤ 1 2 , and assume without loss of generality a 1 ≥ a 2 ≥ . . . ≥ a n > 0 and t > 0.
be the auxiliary variables from the proof of Proposition 7.2, and set Recall that by the law of total probability we have In addition, like in the proof of Proposition 7.2, we observe that 2VB 0 ( f t ) = ∑ n k=1 b k (1, . . . , 1). We will soon prove Combining (43) with the two former observations, we obtain completing the proof. Hence, it is only left to prove (43). For this, consider the family of halfspaces Indeed, an application of Lemma 6.2 to the family of halfspaces { f λ } implies that I max ( f λ ) E[ f λ ] is (up to a constant) an increasing function of t λ t − ∑ i<k a i λ i in the range t λ ≥ 0; In the range t λ ≤ 0, is clearly a decreasing function of t λ , while from Corollary 3.5 I max ( f λ ) is (up to a constant) an increasing function of t λ . This confirms Inequality (44). Hence, .
, as required by (43). This completes the proof.

An Example Showing Discrepancy Between the Upper Boundary and the Lower Boundary, for General Boolean Functions
We conclude this section with an example, suggested by Rani Hod, showing that for general Boolean functions, the difference between VB 1 ( f ) and VB 0 ( f ) can be very large (in contrast to Proposition 7.3, which should be viewed as a property of halfspaces). The example is based on a random construction of Talagrand [49], originally proposed as an example of a monotone Boolean function g with 'maximal possible' vertex boundary VB(g) = Ω(1) and 'maximal possible' total influence I(g) = Ω( √ n).
where, for every i, S i is a random subset of [n] of size b. Also, let f (x) = h(x) ∨ Maj n (x). We claim the following (proofs will be given below).
Hence, there exists an almost unbiased Boolean function f having a multiplicative gap of √ n between VB 0 ( f ) and VB 1 ( f ). (For comparison, Proposition 7.3 implies that for almost unbiased halfspaces we have VB 0 ( f ) = Θ(VB 1 ( f )).) Furthermore, the example implies that apparently, there is no analog to Proposition 7.2 for general functions, as f and its dual function 1 − f (−x) are similar in terms of Fourier expansion (and in particular, have the same influences), but are very different with respect to the VB 1 (·) measure.
Let us verify the above claims.

2.
For an x ∈ {−1, 1} n to be in the upper-boundary of f , either it is in the upper-boundary of h or in that of Maj n . We have VB 1 (Maj n ) = Θ(1/ √ n), and VB 1 (h) ≤ µ(h) ≤ 1/b ≤ 1/ √ n as above. Hence, VB 1 3. For an x ∈ {−1, 1} n to be in the lower-boundary of f , it is sufficient that x is in the lowerboundary of h and Maj n (x) = 0. Let x ∈ {−1, 1} n be chosen uniformly at random among the vectors that satisfy ∑ x i = −2c √ n, for a fixed c ∈ (0, 10). We want to show that with some positive probability ('continuously') depending on c, x lies in the lower-boundary of h. As µ({x | −20 √ n ≤ ∑ x i < 0}) = Ω(1), this will imply VB 0 ( f ) = Ω(1).
Since µ(h) = o(1) as we showed above, it is sufficient to show that Consider a specific i ∈ [a]. The probability O(c))), and is nonzero with a constant probability. Thus, (45) holds, as asserted.

kth Degree Fourier Weight of Halfspaces
The classical level-k inequality [39, Section 9.5] asserts the following.
Theorem 8.1 (Level-k inequality). For any k ∈ N and for any halfspace f t = 1{a · x > t} with ε = µ( f t ) < e −k/2 , we have In this section we prove Theorem 1.3 which asserts that the level-k inequality is tight for strongly biased halfspaces, up to a multiplicative factor depending only on k. Specifically, we prove the following result which clearly includes Theorem 1.3.
Theorem 8.2. There exist universal constants c 1 , c 2 such that for any k ∈ N and for any halfspace Remark. Let us compare Theorem 8.2 with Theorem 1.1. While Theorem 1.1 expresses the tightness of the level-1 inequality for halfspaces (up to a multiplicative constant factor), Theorem 8.2 states that even the level-k inequalities are essentially tight for halfspaces. On the other hand, Theorem 8.2 has three disadvantages. The first is the requirement that all the influences I i ( f t ) are somewhat small. (Note that the maximal possible value of an influence is 2µ( f t ), and so, we are 'missing' a factor of O(k).) The second is that Equation (47) is not the exact converse of the level-k inequality, as there is a factor log(2k) −k off.
The third is that we require µ( f t ) to be at most 2 −999k , instead of e −k/2 in the level-k inequality, that is, a factor c k off. We note however that for a constant k (which is the case highlighted in Theorem 1.3), (47) is indeed tight up to a constant multiplicative factor. We believe the deficiencies of Theorem 8.2 are actually not inherent, and are side effects of our proof. Specifically, it is plausible one can omit the assumption that f t 's influences are small, replace the multiplicative (c 2 log(2k)) −k in Equation (47) by c k for some universal constant c > 0, and claim that the assertion holds whenever µ( f t ) ≤ 2 −k .
This section is organized as follows. In Section 8.1 we present a study a certain 'k-degree perturbation' of influences, that generalizes the perturbation we used in the proof of Theorem 1.1. In Section 8.2 we use these k-degree perturbations to prove Theorem 8.2, modulo two auxiliary claims. These claims are proved in Section 8.3.

A k-Degree Smoothing of Influences
Let f t = 1{a · x > t} be a halfspace. Recall that in the proof of Theorem 1.1, a central role is played by In this subsection we consider the following degree-k generalization of this notion. For a set S, we denote a S = ∏ i∈S a i . We let T = T k be a random variable, distributed as the sum of k independent U(0, 1)-distributed variables (also called 'the Irwin-Hall distribution'). Then, for any S ⊂ [n] with |S| = k we set and consider the quantity The following propositions will help us to study this quantity.
Claim 8.3. Let X = ∑ m i=1 X i be the sum of m independent random variables distributed X i ∼ U(0, 1). Let G m = G X be the cumulative distribution function of X. Then the mth derivative of G m satisfies G (m) . The assertion follows by induction. where Proof. Integrating each time the inner-most integral, one can find by induction that a i t i dt m · · · dt 1 , and so, Equation (49) follows.
Proposition 8.5. Let f t (x) = 1{a · x > t} be a halfspace with ∀i : a i ≥ 0, let S ⊂ [n] be a set of size k, let δ > 0 and let e δ S be as defined in (48).
Proof. Notice we have, by definition, with the somewhat abusive notation a · x = ∑ i∈S a i x i and a · y = ∑ i / ∈S a i y i . Recall that T is defined as the sum of k independent U(0, 1)-distributed variables, and let G k be the cumulative distribution function of T . We have Hence, substituting into (51) and using Fubini's theorem, we obtain In the right hand side, for each fixed y, the expectation over x has the form E x∼{−1,1} m [x [m] g(s + a · x)] discussed in Claim 8.4, with (k, S, G k , (a · y − t)/δ , a/δ ) in place of (m, [m], g, s, a), respectively. Thus, by Claim 8.4, each such expectation can be bounded from below by By Claim 8.3, the kth derivative G (k) k is piece-wise constant, and specifically, satisfies G x . Hence, we can partition the y's into subsets, such that inside each subset, G The 'main' subset we consider is {y ∈ {−1, 1} n | a · y − t ∈ [A, δ − A]}, for which we have ((a · y − t − A)/δ , (a · y − t + A)/δ ) ⊂ (0, 1), and thus, G = 1 for all r ∈ ((a · y − t − A)/δ , (a · y − t + A)/δ ). Therefore, for all y's of this subset we have E x∈{−1,1} S x S G k a·x+a·y−t δ ≥ ∏ i∈S a i δ k . The other subsets correspond to y's for which (a · y − t + A)/δ ∈ [ j, j + 1); as we are interested only in a lower bound, we may take the contributions of all these subsets with a '−' sign, and enlarge each such set of y's for the sake of simplicity. Doing so and substituting into (52), we get One can easily obtain the following two (crude) inequalities: Substituting into (53), we obtain: which implies the assertion of the proposition (notice that for l = 1, we have an extra additive term of 1; this is handled by replacing k−1 l we obtained here with k l in the assertion of the proposition, as k−1 1 + 1 = k 1 ).

Proof of Theorem 8.2
Now we are ready to present the proof of Theorem 8.2. Let us recall the statement of the theorem.
Theorem 8.2. There exist universal constants c 1 , c 2 such that for any k ∈ N and for any halfspace Proof of Theorem 8.2. Let f t (x) be a halfspace that satisfies the assumptions of the theorem and denote ε = µ( f ). Let β be minimal such that F(t + β ) ≤ ε/3 and let γ be minimal such that ∀l ∈ N : F(t + lγ) ≤ ε/(6k) l . Denote δ β + γ. Note that by Theorem 1.10, we have Let A be the sum of the k largest weights a i , and let M ∑ |S|=k a S e δ S , where a S = ∏ i∈S a i and e δ S is as in Proposition 8.5. Similarly to the proof of Theorem 1.1, we are going to prove (47) by combining upper and lower bounds on M. We shall need two technical claims whose proofs will be presented in Section 8.3. Claim 8.6. For any η > 0, there exists a constant c = c(η) such that for any k and for any halfspace f t with µ( f t ) < 2 −999k and I 1 ( f t ) ≤ cµ( f t )/k, we have: where a 1 = max i a i and β is as defined above.
Claim 8.7. Let f t be a halfspace that satisfies the assumptions of Theorem 8.2. For any x ∈ {−1, 1} n with f t (x) = 1, we have ∑ |S|=k a S x S ≥ 0.
As for all s ≥ 0 and for any x ∈ {−1, 1} n we have f t (x) ≥ f t+s (x) ≥ 0, Claim 8.7 implies that Taking expectation over x and using Fubini's theorem, we get that for any s ≥ 0, In particular, as each e δ S is a convex combination of expressions of the form f t+s (S), it follows that By the Cauchy-Schwarz inequality, this implies On the other hand, by Proposition 8.5 we have Combining Equations (54), (55) and (57), we obtain Therefore, the assertion of the theorem will follow once we prove the following bound. Notice e 1 = s 1 = 1. Furthermore, s i is a decreasing sequence satisfying Now, we prove that e m−1 ≤ 2me m for any m ≤ k. Indeed, As e 1 = 1, the assertion follows by induction.
Equation (58) together with Claim 8.8 implies the desired inequality, This completes the proof of Theorem 8.2, modulo the proofs of Claims 8.6 and 8.7 that will be presented below.

Proof of the Auxiliary Claims
In this subsection we prove Claims 8.6 and 8.7, thus accomplishing the proof of Theorem 8.2.
Claim 8.6. For any η > 0, there exists a constant c = c(η) such that for any k and for any halfspace f t with µ( f t ) < 2 −999k and I 1 ( f t ) ≤ cµ( f t )/k, we have: where a 1 = max i a i and β is as defined above.
Proof. Let f t be a halfspace that satisfies µ( f t ) < 2 −999k and I 1 ( f t ) ≤ cµ( f t ), with a sufficiently small c to be determined below. (Note that the assumption on the influences of f t is weaker than the assumption of Claim 8.6.) Denote ε = µ( f t ). By Theorem 1.4, we have cε ≥ I 1 ( f t ) ≥ c a 1 ε log(1/ε), and thus, a 1 ≤ c/c log(1/ε). As by assumption, ε ≤ 2 −999k , it follows that for a sufficiently small c = c(η), we have a 1 ≤ η/ √ k, as desired.
Now, we wish to show 2ka 1 ≤ β , and for this we use the assumption: If a 1 ε 3β ≤ e β 1 ≤ 5I 1 ( f t ), then the assumption I 1 ( f t ) ≤ c 1 ε/k implies a 1 15β ≤ c 1 k , and thus, 2ka 1 ≤ β , provided c 1 is sufficiently small. Thus, we may assume Since for any r, s we have and as I 1 and similarly for f t+β , it follows that By (60), this implies However, as by Corollary 3.5, which contradicts the assumption I 1 ( f t ) ≤ c 1 ε/k for a sufficiently small c 1 . This completes the proof.
In order to prove Claim 8.7, we need another auxiliary claim.
Claim 8.9. For any halfspace f t that satisfies the assumptions of Claim 8.6 with η = 1/16, we have t ≥ 4 √ k.
The following proof method appears in [35]. For completeness we repeat it here.
Proof of Claim 8.9. We will show Pr ∑ a i x i > 4 √ k > 2 −999k . Since by assumption, 2 −999k ≥ µ( f t ) = Pr[∑ a i x i > t], this will imply t > 4 √ k. Partition the a i 's into sets {G s } s , each having sum-of-squares in [ 1 256k , 1 128k ]; this is possible since (To be precise, at most one of the sets G s may have sum-of-squares less than 1 256k . As will be apparent below, this does not affect the proof, so we neglect that set.) For each s, consider the random variable X s = (∑ i∈G s a i x i ) 2 . It is easy to see that E[X s ] = ∑ i∈G s a 2 i , and that Recall that the classical Paley-Zygmund inequality asserts that for any nonnegative random variable Z with a finite second moment and for any α ∈ [0, 1], we have Applying this inequality to the random variable X s = (∑ i∈G s a i x i ) 2 , we get As 1/(16 √ k) ≤ ∑ i∈G s a 2 i by the construction of G s , we infer Pr ∑ i∈G s a i x i > 1/(32 √ k) > 1/10. Since the number of sets {G s } s is between 128k and 256k, we obtain as required.
Now we are ready to prove Claim 8.7.
Claim 8.7. Let f t be a halfspace that satisfies the assumptions of Theorem 8.2. For any x ∈ {−1, 1} n with f t (x) = 1, we have ∑ |S|=k a S x S ≥ 0.
Proof. Let f t satisfy the assumptions of the claim, and let x ∈ {−1, 1} n be such that f t (x) = 1.
a j x j , so that |b i | ≤ a i /t for each i and ∑ i b i = 1. Define as before e m = ∑ |S|=m b S and s m = ∑ i b m i . It is clear that for proving the claim, it is sufficient to prove e k ≥ 0. Clearly, ∀r ∈ N : |s 2+r | ≤ s 2 (max b i ) r . As ∀i : |b i | ≤ a i /t, this implies ∀r ∈ N : |s 2+r | ≤ s 2 max {(a i /t) r } .
Similarly to Claim 8.8, we shall prove by induction that e m−1 ≤ 2me m for each m ≤ k, and so in particular, e k ≥ 2 1−k /k! > 0, as required.
From the Newton-Girard formulas, we have This completes the proof.

Noise Resistance and Correlation with a Halfspace
Recall that a Boolean function f is called Fourier noise resistant if its first degree Fourier weight is within a constant factor of the maximal possible value, i.e., if W 1 ( f ) ≥ c 0 µ( f ) 2 log(1/µ( f )) for a fixed constant c 0 . In this section we prove Theorem 1.8 which asserts that for any Boolean function f , there exists a halfspace g such that Cov( f , g) ≥ Ω log(e/W 1 ( f )) . This implies that if f is Fourier noise resistant then it is strongly correlated with some halfspace g.
In addition, we show that in the special case where f is Fourier noise resistant, one can take the correlating halfspace g to be unbiased, and also there exists a strongly biased halfspace g whose correlation with f is 'surprisingly large'. Finally, we prove Proposition 1.9 which provides a 'probabilistic' notion of noise sensitivity for biased functions that implies strong correlation with a halfspace.

Proof of Theorem 1.8 and a Tightness Example
Let us recall the statement of the theorem. Proof. Let f be a Boolean function, let l(x) = ∑ f ({i})x i be the first Fourier level of f , and denote a l 2 = W 1 ( f ). Consider the family of biased halfpaces {g t (x) = 1{l(x) > t}} t∈R . The proof goes as follows: First, we show that the average correlation of f with a 'random' g t is 'not very small'. Then we use the Hoeffding inequality to assert that Cov( f , g t ) is very small for a large |t|, and deduce that there exists t such that Cov( f , g t ) is 'large', as asserted. Define We have By Hoeffding's inequality, for any t > 0 we have µ (g t ) ≤ Pr[l(x) > |t|] ≤ exp −t 2 /2a 2 , and therefore, |h(t)| ≤ exp −t 2 /2a 2 as well. Notice this also justifies the convergence of the above integrals. Let r = 6 log(2/a). Then A symmetric argument implies r −ra −∞ h(t)dt ≤ a 4 8 , and hence, a 2 − 2 a 4 8r ≤ ra −ra h(t)dt. Thus, there exists t ∈ (−ra, ra) with h(t) ≥ a 2 −a 4 /4r 2ra = Ω a √ log(2/a) , as desired. Theorem 1.8 is clearly tight (up to a constant factor) for any Fourier noise resistant function, as Cov( f , g) cannot exceed µ( f ). The following tightness example is of a different nature, being unbiased, monotone, and noise sensitive.
Example 9.1. Let t(x) be the classical tribes function defined by Ben-Or and Linial [4]. That is, we divide [n] into tribes T 1 , T 2 , . . . , T n/r , each of size r, and let t(x) = 1 ⇔ ∃ j : (x i = 1, ∀i ∈ T j ). The tribe size r is chosen such that E[t(x)] ≈ 1/2. (The size is r ≈ lg n − lg log n.) One can easily show that for any halfspace g(x), we have Cov(t, g) = o n (1).
Denote by η the maximal correlation of t(x) with a halfspace, so that η = o(1). Let h r (x) = 1{a · x > r} be a halfspace of measure η, and let f (x) = t(x) ∨ h r (x). The function f is monotone, we clearly have µ( f ) ≈ 1 2 , and by Theorem 1.1 we have On the other hand, as f = t + h r − t · h r , for any halfspace g we have where the last inequality holds since µ(t · h r ) ≤ µ(h r ) = η, and Cov(t, g) ≤ η by the definition of η. Therefore, the correlation of f with any halfspace is at most 3η = O( W 1 ( f )/ log(e/W 1 ( f ))), which means that Theorem 1.8 is sharp for f .

Remark.
A central feature of Theorem 1.8 is that it holds also for non-monotone functions. In the monotone case, Theorem 1.8 (together with the classical KKL theorem [24]) implies that for any unbiased monotone function f , there exists a halfspace g such that Cov( f , g) = Ω( log n/n). A stronger (and optimal) result of Ω(log n/ √ n) was obtained by O'Donnell and Wimmer [41] who used their result to obtain a provably optimal weak learning algorithm for the class of monotone functions.
The result of O'Donnell and Wimmer also shows that Theorem 1.8 is not tight for monotone unbiased functions with a 'very small' W 1 . Indeed, while the minimal possible value of W 1 is ν ∼ (log n) 2 /n (attained by the tribes function), the result of [41] shows that maximal correlation with a halfspace for a monotone biased function is always at least √ ν = log(n)/ √ n, and not √ ν/ log(e/ν), as we would have obtained if Theorem 1.8 was tight in that range.

A Stronger Correlation Theorem for Noise Resistant Functions
Unlike the classical result of Benjamini et al. [2] which states that any noise resistant function has a strong correlation with an unbiased halfspace, Theorem 1.8 does not guarantee that the correlating halfspace is unbiased. In the following proposition we show that in the special case where f is noise resistant, one may require the correlating halfspace to be unbiased, like in [2]. In the proof of the proposition we use the classical noise operator T ρ , which lies behind the notion of noise sensitivity. The noise operator is defined as where y is obtained from x by independently keeping each coordinate of x unchanged with probability ρ, and replacing it by a random value with probability 1 − ρ. It has a convenient representation in terms of the Fourier expansion of f : we have T ρ ( f ) = ∑ S ρ |S| f (S), and thus, by the Parseval identity, The method we use in the proof was introduced in [27].
Proof. Let f : {−1, 1} n → {0, 1} be a Fourier noise resistant function. Denote µ( f ) = ε, so that x i , and denote g 0 (x) = sgn(l(x)). We show that for an appropriate choice of ρ, the function f has a strong correlation with the 'noisy version' T ρ g 0 . As T ρ g 0 is a convex combination of unbiased halfspaces, this will imply that there exists an unbiased halfspace g 0 that strongly correlates with f . Let ρ be a parameter to be chosen below. Since E[T ρ g 0 ] = E[g 0 ] = 0, we have where the last inequality uses Cauchy-Schwarz. As l( where the first equality uses Parseval's identity and the last inequality employs the Khintchine-Kahane inequality. As f is Fourier noise resistant, we have l 2 = W 1 ( f ) ≥ c 0 ε 2 log(1/ε), and so combining (63) with (64) we get Using the level-k Inequality (Theorem 8.1 above) which asserts that ∀k ≤ 2 log(1/µ( f )) : we obtain E f · T ρ g 0 ≥ ρε · c 0 log(1/ε) Taking ρ = (1/2e) 2 c 0 / log(1/ε), and noting that w.l.o.g. we may assume c 0 ≤ 1, results in Finally, note that by the definition of the noise operator, the function T ρ g 0 is a convex combination of unbiased halfspaces of the form sgn (∑ n i=1 (−1) α i g 0 ({i})x i ), where ∀i : α i ∈ {0, 1}. Hence, there exists an unbiased halfspace g 0 such that Cov( f , g 0 ) = Ω(c 0 ε), as asserted.
Interestingly, one cannot guarantee that the linear form associated with the correlating halfspace is simply l = f =1 like in the unbiased case studied in [2], as can be seen in the following example.

Any Fourier Noise Resistant Function Correlates Well with a Biased Halfspace
We now present another proposition which shows that any Fourier noise resistant function correlates well with a strongly biased halfspace. This result is somewhat surprising, as biased functions correlate badly in general. Proof. Let s = 1 2 α log(1/ε), and consider the halfspace g s (x) = 1{l(x) > s} (with l 2 = 1). By Hoeffding's inequality, we have E[g s ] ≤ ε α/8 . Hence, in order to prove the proposition we have to show that Denote l = 1 f =1 2 f =1 . We clearly have Since it follows from (68) that Note that for any u ≥ s, Taking u = 2 log(1/ε) and combining the four previous inequalities, we obtain Finally, note that we have 1 u ε 2 ≤ 1 2 sε, as otherwise we have √ α log(1/ε) = us ≤ 2ε, and hence, α ≤ 4ε 2 , which contradicts the assumption. Therefore, (69) gives This proves (67), thus completing the proof of the proposition.

A Probabilistic Notion of Noise Resistance
We conclude this section with a 'probabilistic' notion of noise resistance that implies strong correlation with a halfspace. Recall that by the definition of [2], a function f is called noise resistant if S ρ ( f ) = Ω(1) for any constant ρ. If we want to generalize this definition to biased functions, the rate of the noise we consider must depend on the expectation of the function, as will be shown below. In order to find a natural rate of noise for biased functions, we use (once again) the relation of noise sensitivity to the Fourier expansion of the function.
Determining the 'right' rate of noise. Using the expansion and the level-k inequality (i.e., Theorem 8.1 above), we get where ε = µ( f ). By Stirling's approximation, n! ≈ √ 2πn(n/e) n , this implies S ρ ( f ) − ρ 2 log(1/ε) 1 − ρ ≤ ε 2 ∑ k (c ρ log(1/ε)) k k! ≤ ε 2 exp(c ρ log(1/ε)), where c is a universal constant. It follows that if ρ = o(1/ log(1/ε)) then S ρ ( f ) is very small for any function f . Hence, we consider noise rate of ρ = Θ(1/ log(1/ε)), for which S ρ ( f ) can be as large as µ( f ) 2 , and say that f is noise resistant if S ρ ( f ) = Ω(µ( f ) 2 ). We we prove Proposition 1.9 which asserts that this notion of noise resistance implies strong correlation with a halfspace. Proof. Recall that the quantitative version of the BKS noise sensitivity theorem [30] asserts that for any monotone f and for any k which satisfies W 1 ( f ) ≤ exp(−2(k − 1)), we have Let f be a function that satisfies the assumptions of the proposition. Denote ε = µ( f ) and let α satisfy W 1 ( f ) = ε α . We shall compute an upper bound for S ρ ( f ) when ρ = c/ log(1/ε), for a constant c to be specified below. Set T = log(1/ε)/2, and note that since W 1 ( f ) ≤ ε by the Poincaré inequality, we may apply (70) for all k ≤ T . Hence, we have where the last inequality follows by taking c to be a sufficiently small constant and the '100' in the denominator can be taken to be any constant (determined by c).
(This provides a biased version of the noise stability theorem of Peres for halfspaces [42]). In light of the previous results, one might wonder whether every monotone function f that satisfies (72) is well-correlated with some halfspace. This indeed holds for unbiased functions, as in this case, (72) implies that f is noise stable (according to the notation of [2]), and consequently, satisfies W 1 ( f ) = Ω(1), which in turn implies that f correlates well with a halfspace.
However, this does not generalize to the biased setting, as can be seen in the following example. Let x i, j be a variant of the tribes function, with a = 1/ε and b = 2 lg(1/ε). Clearly, µ( f ) ≈ ε. It can be shown that on the one hand, f satisfies (72), and on the other hand, f does not correlate well with any halfspace (i.e., Cov( f , g t ) = o(ε) for any halfspace g t ).