Product Space Models of Correlation: Between Noise Stability and Additive Combinatorics

There is a common theme to research questions in additive combinatorics and noise stability. Both study the following basic question: Let $P$ be a probability distribution over a space $\Omega^\ell$ with all $\ell$ marginals equal. Let $X^{(1)}, \ldots, X^{(\ell)}$ where $X^{(j)} = (X_1^{(j)}, \ldots, X_n^{(j)})$, be random vectors such that for every coordinate $i \in [n]$ the tuples $(X_i^{(1)}, \ldots, X_i^{(\ell)})$ are i.i.d.~according to $P$. A central question that is addressed in both areas is: Does there exist a function $c()$ independent of $n$ such that for every $f: \Omega^n \to [0, 1]$ with $E[f(X^{(1)})] = \mu>0$: \begin{align*} E \left[ \prod_{j=1}^\ell f(X^{(j)}) \right] \ge c(\mu)>0 \; ? \end{align*} Instances of this question include the Finite Field Model version of Roth's and Szemeredi's theorems as well as Borell's result about the optimality of noise stability of half-spaces. Our goal in this paper is to interpolate between the noise stability theory and the finite field additive combinatorics theory and address the question above in further generality than considered before. In particular, we settle the question for $\ell = 2$ and when $\ell>2$ and $P$ has bounded correlation $\rho(P)<1$. Under the same conditions we also characterize the {\em obstructions} for a similar lower bounds in the case of $k$ different functions. Part of the novelty in our proof is the combination of analytic arguments from the theories of influences and hyper-contraction with arguments from additive combinatorics.

A central question that is addressed in both areas is: • Does there exist a function c P () independent of n such that for every f : Ω n → [0, 1] with E[f (X (1) )] = µ > 0: Instances of this question include the Finite Field Model version of Roth's and Szemerédi's theorems as well as Borell's result about the optimality of noise stability of half-spaces.
Our goal in this paper is to interpolate between the noise stability theory and the finite field additive combinatorics theory and address the question above in further generality than considered before. In particular, we settle the question for ℓ = 2 and when ℓ > 2 and P has bounded correlation ρ(P) < 1. Under the same conditions we also characterize the obstructions for a similar lower bounds in the case of k different functions. Part of the novelty in our proof is the combination of analytic arguments from the theories of influences and hyper-contraction with arguments from additive combinatorics.

Setup and same set hitting
In this paper we analyze a general framework which includes many fundamental questions in both the theory of noise stability and in finite field models of additive combinatorics. We begin with formally defining this general setting. Let Ω be a finite set and assume we are given a probability distribution P over Ω ℓ for some ℓ ≥ 2 -we will call it an ℓ-step probability distribution over Ω.
Furthermore, assume we are given n ∈ N. We consider ℓ vectors X (1) , . . . , X (ℓ) , X (j) = (X i ) is sampled according to P, independently of the other coordinates i ′ = i (see Figure 1 for an overview of the notation).
In this paper we address the question: which distributions P are same-set hitting? We achieve full characterization for ℓ = 2 and answer the question affirmatively for a large class of distributions with ℓ > 2.
The question of set hitting was studied extensively in additive combinatorics and in the theory of influences and noise stability. Perhaps the most well-studied case is that of random arithmetic progressions. Let Z be a finite additive group and ℓ ∈ N. Then, we can define a distribution P Z,ℓ of random ℓ-step arithmetic progressions in Z. Specifically, for every x, r ∈ Z we set: P Z,ℓ (x, x + r, x + 2r, . . . , x + (ℓ − 1)r) := 1/|Z| 2 .
As is well known, the case ℓ = 3 follows from the arguments of Roth [Rot53] applied to the finite field setup, while the general case follows a long line of work, starting by Szemerédi lemma [Sze75], its proof by Furstenberg using the ergodic theorem [Fur77] as well as the finite group and multi-dimensional versions, see, e.g., [Rot53,FK91,Gow01,Gre05].
It is natural to consider a generalization of the question where different functions are applied to different X (i) . This question was studied in the theories of Gaussian noise stability and hyper-contraction as we explain next.
Borell [Bor85] established the set hitting property in the Gaussian case where (X i , Y i ) ∼ N (0, 1 ρ ρ 1 ) are i.i.d and ρ ∈ (0, 1). In fact [Bor85] does much more: it finds the optimal δ in terms of µ and ρ in this case. (Note that in this case Ω is infinite).
In earlier work, [Bor82] Borell also proved some of the first reverse-hyper contractive inequalities. These give a different proof that the Gaussian example above is set hitting but also imply the same for the binary analog where (X i , Y i ) ∈ {−1, 1} 2 satisfy E[X i ] = E[Y i ] = 0 and E[X i Y i ] = ρ. See [MOR + 06] for a discussion of this result and some of its implications.
The full classification of set hitting distributions can be deduced from a paper on reverse hypercontractivity 1 by Mossel, Oleszkiewicz and Sen [MOS13]: Theorem 1.4 (follows from [MOS13]). A finite probability space P is set hitting if and only if: In many interesting settings, include the finite field models in additive combinatorics, the distribution P does not have full support. In these settings, the goal is to understand sufficient conditions on the functions which imply that (1) does hold as we discuss next.

Obstructions in additive combinatorics
In general much of the interest in additive combinatorics is in understanding what conditions on functions f imply (1). In the setup of Roth's Theorem [Rot53], i.e., arithmetic progressions of length 3, it is known that if the functions f 1 , f 2 , f 3 all satisfy that f i − E[f ] ∞ is small then (1) holds. More recently, Gowers showed that for longer terms arithmetic progression if the functions f i have low Gowers norm then (1) holds, see e.g. [Gre05].
In one of our main results (see Section 1.5.3 below) we show that in a pretty general setup (which does not include the additive combinatorics setup), the only obstruction for (1) to hold is for it to have large Fourier coefficients.

Basic example
At this point we would like to introduce the simplest example that is not covered by either the theory of influences or techniques from additive combinatorics. Let S ⊆ {0, 1, 2} n be a non-empty set of density µ = |S| 3 n . We pick a random vector X = (X 1 , . . . , X n ) uniformly from {0, 1, 2} n , and then sample another vector Y = (Y 1 , . . . , Y n ) such that for each i independently, coordinate Y i is picked uniformly in {X i , X i + 1 mod 3}. Our goal is to show that: In other words, we want to bound away the probability from 0 by an expression which only depends on µ and not on n. Similarly given sets S and T of density at least µ, we want to find under what conditions does it holds that the probability Pr[X ∈ S ∧ Y ∈ S] can be lower bounded effectively. We note that the support of the distribution on {0, 1, 2} 2 is not full and that the distribution is not of arithmetic nature.

Same set hitting for two steps
In case of ℓ = 2 we establish the following theorem: Theorem 1.5 (cf. Theorem 3.1). A two-step probability distribution with equal marginals P is same-set hitting if and only if α(P) := min x∈Ω P(x, x) > 0.
Of course, if β(P) > 0, then Theorem 1.5 follows from Theorem 1.4. However, we are not aware of any previous work in case β(P) = 0, i.e., when the distribution is same-set hitting but not set hitting, in particular for the probability space from Section 1.4.

Same set hitting for more than two steps
In a general case of an ℓ-step distribution with equal marginals, it is still clear that α(P) > 0 is necessary. However, it remains open if it is sufficient.
We provide the following partial results. Firstly, by a simple inductive argument based on Theorem 3.1, we show that multi-step probability spaces induced by Markov chains are same-set hitting (cf. Section 8).
We are not aware of any general results in case ρ(P) = 1. In particular, let P be a three-step distribution over Ω = {0, 1, 2} such that X is uniform over {000, 111, 222, 012, 120, 201}. To the best of our knowledge, it is an open question whether this distribution P is same-set hitting.
One might conjecture that α(P) > 0 is the sole sufficient condition for same-set hitting. Unfortunately, the techniques used to prove Theorem 1.2 do not seem to extend easily to less symmetric spaces. This suggests that proving the conjecture fully in ρ = 1 case might be a difficult undertaking.

Set hitting for functions with no large Fourier coefficients
The methods developed here also allow to obtain lower bounds on the probability of hitting multiple sets. In fact, we show that if ρ(P) < 1, then such lower bounds exist in terms of ρ, the measures of the sets and the largest non-empty Fourier coefficient. Theorem 1.7 (Informal, cf. Theorem 3.3). Let P be a probability distribution with ρ(P) < 1. Then, P is set-hitting for functions f (1) , . . . , f (ℓ) : Ω → [0, 1] that have both: • No large Fourier coefficients, i.e., max σ f (j) (σ) ≤ o(1).

Other related work
The case of ρ < 1 has also been studied in the context of extremal combinatorics and hardness of approximation. In particular, Mossel [Mos10] uses the invariance principle to prove that if ρ(P) < 1, then P is set hitting for low-influence functions. We use this result to establish Theorem 1.6. Additionally, Theorem 1.7 can be seen as a strengthening of [Mos10].
Furthermore, Austrin and Mossel [AM13] establish the result equivalent to Theorem 1.7 assuming in addition to ρ(P) < 1 also that P is pairwise independent (they also prove results for the case ρ(P) = 1 with pairwise independence but these involve only bounded degree functions).
Finally we note that another relevant paper in the case of ℓ = 2 and symmetric P is by Dinur, Friedgut and Regev [DFR08], who give a characterization of non-hitting sets. However, due to a different framework, their results are not directly comparable to ours.
Our work is related to problems and results in inapproximability in theoretical computer science. For example, our theorem is related to the proof of hardness for rainbow colorings of hypergraphs by Guruswami and Lee [GL15]. In particular, it is connected to their Theorem 4.3 and partially answers their Questions C.4 and C.6.

Proof combining ideas from additive combinatorics and the theory of influences
Interestingly, the proof of our results interpolates between additive combinatorics and the theory of influences. Results of [Mos10] imply that if a collection of functions have low influences then they are same-set hitting. In the proof of Theorem 3.2 we apply a variant of a density increase argument to reduce to this case. First we apply the standard density increase argument to assume WLOG of generality that conditioning on a small number of coordinates does not change the measure of the set by much. Then we show, under this assumption, by applying a variant of the density increase argument that we can additionally assume WLOG that all influences are small.

Outline of the paper
The rest of the paper is organised as follows: the notation is introduced in Section 2, Section 3 contains full statements of our theorems and Section 4 sketches the proof of our main theorem. The full proof of the multi-step theorem follows in Section 5. The proof of the twostep theorem is in Section 6 and the proof for functions with small Fourier coefficients in Section 7. A theorem for Markov chains is introduced in Section 8 and better bounds for symmetric spaces in Section 9. Finally, the modified proof of the low-influence theorem from [Mos10] is presented in the appendix. We note that an extended abstract with preliminary version of our results appeared in [HHM16].

Notation
We will now introduce our setting and notation. We refer the reader to Figure 1 for an overview.
We always assume that we have n independent coordinates. In each coordinate i we pick ℓ values X (j) i for j ∈ [ℓ] = {1, . . . , ℓ} at random using some distribution. Each value X (j) i is chosen from the same fixed set Ω, and the distribution of the tuple i ) of values from Ω ℓ is given by a distribution P. This gives us values X (j) i for i ∈ {1, . . . , n} and j ∈ {1, . . . , ℓ}. Thus, we have ℓ vectors X (1) , . . . , X (ℓ) , where X (j) = (X (j) 1 , . . . , X (j) n ) represents the j-th step of the random process. In case ℓ = 2, we might call our two vectors X and Y instead.
For reasons outlined in Section 3.4.2 we assume that all of X i have the same marginal distribution, which we call π. We assume that Ω is the support of π.
Even though it is not necessary, for clarity of the presentation we assume that each i ) has the same distribution P. We consistently use index i to index over the coordinates (from [n]) and j to index over the steps (from [ℓ]).
We sometimes call P a tensorized, multi-step probability distribution as opposed to a tensorized, single-step distribution π and single-coordinate, multi-step distribution P.
Furthermore, we extend the index notation to subsets of indices or steps. For example, for S ⊆ [ℓ] we define X (S) to be the collection of random variables X (j) : j ∈ S .
One should think of ℓ and |Ω| as constants and of n as large. We aim to get bounds which are independent of n.

Correlation
In case ℓ > 2, the bound we obtain will depend on the correlation of the distribution P. This concept was used before in [Mos10].
Definition 2.1. Let P be a single-coordinate distribution and let S, T ⊆ [ℓ]. We define the correlation:

Influence
A crucial notion in the proof of Theorem 1.6 is the influence of a function. It expresses the average variance of a function, given that all but one of its n inputs have been fixed to random values: Definition 2.2. Let X be a random vector over alphabet Ω and f : Ω → R be a function and i ∈ [n]. The influence of f on the i-th coordinate is: Inf i (f (X)). Note that the influence depends both on the function f and the distribution of the vector X.

Our Results
Here we give precise statements of our results presented in the introduction.

The case of ℓ = 2
Theorem 3.1. Let Ω be a finite set and P a probability distribution over Ω 2 with equal marginals π. Let pairs (X i , Y i ) be i.i.d. according to P for i ∈ {1, . . . , n}.
Then, for every f : where the function c() is positive whenever α(P) > 0.
We remark that Theorem 3.1 does not depend on ρ(P) in any way. This is in contrast to the case ℓ > 2. It is possible to obtain a polynomially large explicit bound c() for symmetric two-step spaces (see Section 9).
To prove Theorem 3.1 we make a convex decomposition argument and then apply the multi-step Theorem 3.2 (see Section 6). For completeness, we provide a proof of Theorem 1.5 assuming Theorem 3.1.
Proof of Theorem 1.5. The "if" part follows from Theorem 3.1. The "only if" can be seen by taking f to be an appropriate dictator.

The general case
Theorem 3.2. Let Ω be a finite set and P a distribution over Ω ℓ in which all marginals are equal. Let tuples where the function c() is positive whenever α(P) > 0 and ρ(P) < 1. Furthermore, there exists some D(P) > 0 (more precisely, D depends on α, ρ and ℓ) such that if µ ∈ (0, 0.99], one can take: Note that this bound does depend on ρ(P). We also obtain a bound that does not depend on ρ(P) for multi-step probability spaces generated by Markov chains (see Section 8).

Equal distributions: unnecessary
In Theorems 3.1, 3.2 and 3.3 we assumed that the tuples (X This is not the case. Instead, we made this assumption for simplicity of notation and presentation. If one is interested in statements which are valid where coordinate i is distributed according to P i , one simply needs to assume that there are α > 0 and ρ < 1 such that α(P i ) ≥ α and ρ(P i ) ≤ ρ.
In case β(P) = 0, we demonstrate an example which shows that E ℓ j=1 f (X (j) ) can be exponentially small in n. For concreteness, we set ℓ := 2 and Ω := {0, 1} and consider P which picks uniformly among {00, 01, 11}. We then set where wt(x) is the Hamming-weight of x, i.e., the number of ones in x. For large enough n, a concentration bound implies that Pr[X (1) ∈ S 1 ] > 1 3 − 0.01 and Pr[X (2) ∈ S 2 ] > 1 3 − 0.01. Hence, if we set f to be the indicator function of S := S 1 ∪ S 2 , the assumption of Theorem 3.2 holds. However, because of the first coordinate we have Pr[X (1) ∈ S ∧ X (2) ∈ S] ≤ Pr[X (1) ∈ S 2 ∨ X (2) ∈ S 1 ], and the right hand side is easily seen to be exponentially small.
It is not difficult to extend this example to any distribution with β(P) = 0 that does not have equal marginals.

Proof Sketch
In this section we briefly outline the proof of Theorem 3.2. For simplicity, we assume that the probability space is the one from Section 1.4, i.e., (X i , Y i ) are distributed uniformly in {00, 11, 22, 01, 12, 20}. Additionally, we assume that we are given a set S ⊆ {0, 1, 2} n with µ(S) = |S|/3 n > 0, so that we want a bound of the form The proof consists of three steps. Intuitively, in the first step we deal with dictator sets, e.g., S dict = {x : x 1 = 0}, in the second step with linear sets, e.g., S lin = {x : n i=1 x i (mod 3) = 0} and in the third step with threshold sets, e.g., S thr = {x : |{i : x i = 0}| ≥ n/3}.

Step 1 -making a set resilient
We call a set resilient if Pr[X ∈ S] does not change by more than a (small) multiplicative constant factor whenever conditioned on (X i 1 = x i 1 , . . . , X is = x is ) on a constant number s of coordinates.
In particular, S dict is not resilient (because conditioning on x 1 = 0 increases the measure of the set to 1), while S lin and S thr are.
If a set is not resilient, using P(x, x) = 1/6 for every x ∈ Ω, one can find an event Since each such conditioning increases the measure of the set S by a constant factor, S must become resilient after a constant number of its iterations. Furthermore, each conditioning induces only a constant factor loss in Pr[X ∈ S ∧ Y ∈ S].

Step 2 -eliminating high influences
In this step, assuming that S is resilient, we condition on a constant number of coordinates to transform it into two sets S ′ and T ′ such that: • Both of them have low influences on all coordinates.
• Both of them are supersets of S (after the conditioning).
The first property allows us to apply low-influence set hitting from [Mos10] to S ′ and T ′ . The second one, together with the resilience of S, ensures that µ(S ′ ), µ(T ′ ) ≥ (1 − ǫ)µ(S).
In fact, it is more convenient to assume that we are initially given two resilient sets S and T . Assume is strictly greater than the sum µ(S) + µ(T ): We choose to delete the coordinate i and replace S with S ′ := S z and T with T ′ := T * z . Equation (9) implies that after a constant number of such operations, neither S nor T has any remaining high-influence coordinates.
Crucially, with respect to same-set hitting our set replacement is essentially equivalent to conditioning on X i = z and Y i = z ∨ Y i = z + 1 (mod 3). Therefore, each operation induces only a constant factor loss in Pr[X ∈ S ∧ Y ∈ T ].

Step 3 -applying low-influence theorem from [Mos10]
Once we are left with two low-influence, somewhat-large sets S and T , we obtain Pr[X ∈ S ∧ Y ∈ T ] ≥ c(µ) > 0 by a straightforward application of a slightly modified version of Theorem 1.14 from [Mos10]. The theorem gives that ρ(P) < 1 implies that the distribution P is set hitting for low-influence functions: Theorem 4.1. Let X be a random vector distributed according to (Ω, P) such that P has equal marginals, ρ(P) ≤ ρ < 1 and min x∈Ω π(x) ≥ α > 0.

The case ρ = 1 : open question
Theorem 3.2 requires that ρ < 1 in order to give a meaningful bound. It is unclear whether this is an artifact of our proof or if it is necessary. In particular, consider the three step distribution P which picks a uniform triple from {000, 111, 222, 012, 120, 201}. One easily checks that ρ(P) = 1 and that all marginals are uniform. We do not know if this distribution is same-set hitting.
However, the method of our proof breaks down. We illustrate the reason in the following lemma.
Lemma 4.2. For every n > n 0 there exist three sets S (1) , S (2) , and S (3) such that for the distribution P as described above we have • The characteristic functions ½ S (j) of the three sets all satisfy While the lemma does not give information about whether P is same-set hitting, it shows that our proof fails (since the analogue of Theorem 4.1 fails). Whenever we pick X (1) , X (2) , X (3) , the number of twos in X (1) plus the number of ones in X (2) plus the number of zeros in X (3) always equals n (there is a contribution of one from each coordinate). All three properties are now easy to check.

Proof for General ℓ and ρ(K) < 1
The goal of this section is to prove our second main result, which we restate here for convenience.

Theorem 3.2.
Let Ω be a finite set and P a distribution over Ω ℓ in which all marginals are equal. Let tuples X i = (X Tuples X i are i.i.d. according to P. Each of the ℓ marginals of P is π. Vectors X (j) are distributed (dependently) according to π := π n . is distributed according to π. The overall distribution of X is P.

Properties of the correlation
Recall Definition 2.1. We now give an alternative characterization of ρ(P, {j}, [ℓ] \ {j}) which will be useful later. For this, we first define certain random process and an associated Markov chain.
• Assuming that X \j = x \j , the random variables Y and Z are then sampled independently of each other according to the j-th step of P conditioned on X \j = x \j .
Sometimes we will omit X \j from the notation and refer as double sample to (Y, Z) alone.
An equivalent interpretation of a double sample is that after sampling (X \j , Y ) according to P we "forget" about Y and sample Z again from the same distribution (keeping the same value of X \j ). Therefore, both (X \j , Y ) and (X \j , Z) are distributed according to P. we see that which means that K is the kernel of a Markov chain that is reversible with respect to π (see e.g., [LPW08, Section 1.6]). Thus, K has an orthonormal eigenbasis with eigenvalues 1 = λ 1 (K) ≥ λ 2 (K) ≥ · · · ≥ λ |Ω| (K) ≥ −1, (e.g., [LPW08, Lemma 12.2]). We will say that K is the Markov kernel induced by the double sample (Y, Z). A standard fact from the Markov chain theory expresses λ 2 (K) in terms of covariance of functions f ∈ L 2 (Ω, π): Lemma 5.2 (Lemma 13.12 in [LPW08]). Let Y, Z be two consecutive steps of a reversible Markov chain with kernel K such that both Y and Z are distributed according to a stationary distribution of K. Then, Lemma 5.3. Let P be a single-coordinate distribution and let (X \j , Y, Z) be a double sample from P that induces a Markov kernel K. Then, Proof. For readability, let us write X instead of X \j .
Consider first two functions f and g as in Definition 2.1 and assume without loss of and that there exists a choice of f and g that achieves equality in (15).
Now, by Cauchy-Schwarz, (16) and Lemma 5.2 we see that The equality is obtained for f that maximizes the right-hand side of (14) and g := c · h for some c > 0.
For later use, we make the following implication of Lemma 5.3.

Reduction to the resilient case
In this section, we will prove that we can assume that the function f is resilient in the following sense: whenever we fix a constant number of inputs to some value, the expected value of f remains roughly the same. The intuitive reason for this is simple: if there is some way to fix the coordinates which changes the expected value of f , we can fix these coordinates such that the expected value increases, which only makes our task easier (and can be done only a constant number of times).
We first make the concept of "fixing" a subset of the coordinates formal.
The coordinates with r i = ⋆ are unrestricted, the coordinates where r i ∈ Ω are restricted. The size of a restriction is the number of restricted coordinates.
A restriction R operates on a function f as Next, we define what it means for a function to be resilient: restrictions do not change the expectation too much.
Definition 5.6. Let X be a random vector distributed according to a (single-step) distribution (Ω, π). A function f : Ω → [0, 1] is ǫ-resilient up to size k if for every restriction R of size at most k we have that The function is upper resilient if the expectation cannot increase too much.
Definition 5.7. Let X be a random vector distributed according to a distribution (Ω, π).
Resilience and upper resilience are equivalent up to a multiplicative factor which depends only on k and the smallest probability in the marginal distribution α(π). Intuitively the reason is that if there is some restriction which decreases the 1-norm, then some other restriction on the same coordinates must increase the 1-norm somewhat.
Proof. Fix a subset S ⊆ [n] of the coordinates of size |S| ≤ k. We consider a random variable R whose values are restrictions with restricted coordinates being exactly S. The elements r i ∈ Ω for i ∈ S are picked according to the distribution π. We let p(R ′ ) be the probability a certain restriction R ′ is picked, and get where we sum over all restrictions R ′ that restrict exactly the coordinates in S. Let now R * be one of the possible choices for R. Then, and hence: Since p(R * ) ≥ α(π) k we get the bound for the restriction R * , which was chosen arbitrarily.
Lemma 5.9. Let X be a random vector distributed according to a distribution with equal marginals (Ω, P) and f : Ω Then, there exists a restriction R such that g := (Rf ) is ǫ-resilient up to size k and where c := exp − 2 ln 1/µ In particular, c depends only on ǫ, k, α(P) and µ (requiring ǫ, α(P), µ > 0).
We repeat this, replacing f with (Rf ), until there is no such restriction.
Since the expectation of f only increases, we get (20). Finally, once the process stops, the resulting function is ǫ-resilient due to Lemma 5.8 (note that α(π) ≥ α).
It remains to argue that (21) holds for the resulting function. Note first that the expectation cannot exceed 1, and hence the process will be repeated at most p := ln(1/µ)/ ln(1+ ǫ ′ ) ≤ 2 ln(1/µ) ǫ ′ times. Therefore, the final restriction R obtained after at most p iterations of the process above is of size at most pk.
Define g := (Rf ) and let E be the event that all strings X (1) , . . . , X (ℓ) agree with the restriction R in its restricted coordinates. We will use ½(E) to denote the function which is 1 if event E happens and 0 otherwise. We see that Finally,

Reduction to the low-influence case
We next show that if f is resilient, we can also assume that it has only low influences. However, this part of the proof actually produces a collection of functions g (1) , . . . , g (ℓ) such that each of them has small influences: it operates differently on each function. In turn, it is more convenient to do this part of the proof also starting from a collection f (1) , . . . , f (ℓ) , as long as all of them are sufficiently resilient.
As in the previous section, we use restrictions. Here, however, we are only interested in restrictions of size one. Consequently, we write R[i, a] to denote the restriction R = (r 1 , . . . , r n ) with r i = a and r i ′ = ⋆ for i ′ = i.
Furthermore, we require a new operator.
We define the operator M[i, y, z] as )] + c for some y, z ∈ Ω and c > 0. This implies that we can use this operator to increase the expectation of a function unless all of its influences are small. We will prove this property later.
Second, fix a step j * ∈ [ℓ] and assume that for some values and Pr[X are "somewhat large" (larger than some constant).
We imagine now that X (\j * ) i = x \j * and that we have also picked all values X is picked among y and z such that it maximizes f (j * ) . Since this happens with constant probability, we conclude the following: Suppose we replace f (j * ) with M[i, y, z]f (j * ) and then prove that afterwards This second point is formalized in the following lemma: Suppose that: Then: Proof. We first define a random variable A, which is the value among y and z which X (j * ) i needs to take in order to maximize f (j * ) . Formally, Consider now the event E which occurs if X i = (x \j , A). We get The equality from the first to the second line follows because if the event E happens, then the functions f (j) (X (j) ) and g (j) (X (j) ) are equal. From the third to the fourth line we use that conditioned on X \i the functions g (j) (X (j) ) are constant. Finally, the last inequality follows because by (22) and (23), for every choice of X \i = (X 1 , . . . , X i−1 , X i+1 , . . . , X n ) event E has probability at least β.
The obvious idea for the next step would be to find values x \j , y, z such that Unfortunately, there is a problem with this strategy. To replace the function f (j * ) with M[i, y, z]f (j * ) , Lemma 5.11 also replaces f (j) with R[i, x (j) ]f (j) for j = j * (and this is required for the proof to work). Unfortunately, it is possible that . We remark that we cannot use that f (j) is resilient here: while f (j) is resilient the first time we condition, the functions M[i, y, z]f (j) obtained in the subsequent steps are not resilient in general, so later steps will not have the guarantee.
Our solution is to pick the values (X \j * , Y, Z) at random, as a double sample on coordinate j * (cf. Definition 5.1). Let: We prove that (in expectation over To argue that the sum of expectations increases, the key part is to show that E G (j * ) (X (j * ) ) increases by a constant.
Let X be a random vector, independent of this double sample and distributed according to a single-step distribution (Ω, π) such that π is the j * -th marginal distribution of P.
Then, for every i ∈ [n] and every function f : where τ = Inf i (f (X)).
Recall that the distribution of (Y, Z) depends on j * . We do not need to consider the full multi-step process in this lemma, but when applying it later we will set X = X (j * ) and Proof. Fix a vector x \i for X \i , and define the function h : and hence, averaging over X \i , Since Y and Z are symmetric (i.e., they define a reversible Markov chain, cf. remarks after Definition 5.1) and by (28), as claimed.
Lemma 5.13. Let a random vector X be distributed according to (Ω, P) and functions Pick a double sample (X (j * ) , Y, Z) from P and let: Then: Note that (29) defines the functions G (j) as random variables which is why we use capital letters.
since the marginal distribution of X (j) is exactly as in the marginal π of P. Hence, it suffices to show that but this is exactly Lemma 5.12.
Lemma 5.14. Let X be a random vector distributed according to (Ω, P) and also let . , x (ℓ) ), y, z such that the functions While (33) is immediate from Lemma 5.13, we have to do a little bit of work to guarantee (34).
Proof. Choose (X \j * , Y, Z) as a double sample from P and let G (j) be defined as in (29).
We can now repeat the process from Lemma 5.14 multiple times to get the result of this section.
Proof. We repeat the process from Lemma 5.14, always replacing the collection of func- )] cannot exceed ℓ and every time it increases by τ (1 − ρ 2 )/2, we have to do this at most The first point is then obvious, and the second point follows from Lemma 5.14. Finally, the third point follows because the functions f (j) are all ǫ-resilient up to size k, and each of the functions g (j) can be written as a maximum of restrictions of size at most k of f (j) . Since the maximum only increases expectations, the proof follows.

Finishing the proof
Proof of Theorem 3.2. Let us assume that µ ∈ (0, 0.99], the computations being only easier if this is not the case. To establish (5), whenever we say "constant", in the O() notation or otherwise, we mean "depending only on P (in particular, on α, ρ, |Ω| and ℓ), but not on µ".

Proof for Two Steps
Our goal in this section is to prove Theorem 3.1 assuming Theorem 3.2. In the following we will sometimes drop the assumption that Ω is necessarily the support of a probability distribution P. One can check that this will not cause problems.

Correlation of a cycle
Assume we are given a support set Ω of size |Ω| = k. Let s ≥ 2, p ∈ (0, 1) and let (x 0 , . . . , x s−1 ) be a pairwise disjoint sequence of x i ∈ Ω.
Definition 6.1. We call a probability distribution C over Ω an (s, p)-cycle if otherwise.
Lemma 6.2. Let C be an (s, p)-cycle. Then Proof. Let K be the Markov kernel induced by a double sample on C (K is the same whether a sample is on the first or the second step, cf. Section 5.1). Observe that Let α k := 2πk s . One can check that the eigenvalues of K are λ 0 , . . . , λ s−1 with λ k := 1 − 2p(1 − p)(1 − cos α k ). This is easiest if one knows the respective (complex) eigenvectors v k := (1, exp(α k ı), . . . , exp((s − 1)α k ı)) (where ı is the imaginary unit). Using The bound on ρ(C) now follows from Lemma 5.3.

Convex decomposition of P
In this section we show that if a distribution P can be decomposed into a convex combination of distributions P = r k=1 α k P k and each distribution P k is same-set hitting, then also P is same-set hitting. Definition 6.3. We say that a probability distribution with equal marginals P has an (α, ρ)-convex decomposition if there exist β 1 , . . . , β r > 0 with r k=1 β k = 1 and distributions with equal marginals P 1 , . . . , P r such that P = r k=1 β k · P k . and α(P k ) ≥ α and ρ(P k ) ≤ ρ for every k ∈ [r].
Then, for every function f : Proof. Let us write the relevant decomposition as P = r k=1 β k P k . The existence of this decomposition implies that there exists a random vector Z = (Z 1 , . . . , Z n ) such that: • For every i ∈ [n] and k ∈ [r], conditioned on Z i = k, the tuple X i is distributed according to P k .

Decomposition of P into cycles
Definition 6.5. Let us consider weighted directed graphs with non-negative weights over a vertex set Ω. We will identify such a digraph G with its weight matrix.
We say that such a weighted digraph is regular, if for every vertex the total weight of the incoming edges is equal to the total weight of the outgoing edges.
We call a weighted digraph a weighted cycle, if it is a directed cycle over a subset of Ω with all edges of the same weight w > 0. We call w the weight of the cycle and number of its edges s the size of the cycle.
We say that a weighted digraph G can be decomposed into r weighted cycles if there exist weighted cycles C 1 , . . . , C r such that G = r k=1 C k .
Lemma 6.6. Every regular weighted digraph G over a set Ω of size k can be decomposed into at most k 2 weighted cycles.
Proof. Since the digraph is regular, it must have a cycle. Remove it from the graph (taking as weight w the minimum weight of the edge on this cycle).
Since the resulting graph is still regular, proceed by induction until the graph is empty. At each step at least one edge is completely removed from the graph, therefore there will be at most k 2 steps.
To see that a two-step distribution P can be decomposed into cycles, it will be useful to take P ′ := P − α · Id and look at it as a weighted directed graph (Ω, P ′ ), where P ′ is interpreted as a weight function P ′ : Ω × Ω → R ≥0 .
Lemma 6.7. Let P be a two-step distribution with equal marginals over an alphabet Ω with size t.
Then, P has a convex decomposition P = r k=1 β k P k such that each P k either has support of size 1 or is an (s, p)-cycle with 2 ≤ s ≤ t and p ∈ [α(P) 3 , 1/2].
Proof. Throughout this proof we will treat P as a weight matrix of a digraph. Since P has equal marginals, this weighted digraph is regular. Use Lemma 6.6 to decompose P − α(P) · Id into weighted cycles, which allows us to write where C k is a weighted cycle with weight w k and size s k and r ≤ t 2 . Take β k := min(w k , α(P)/t 2 ) and let Id k be the identity matrix restricted to the support of C k . Now we can write P as Firstly, (α(P) · Id − r k=1 β k Id k ) can be decomposed into distributions with support size 1.
As for the other term, note that C k := β k Id k +C k s k (w k +β k ) is a probability distribution that either has support of size 1 (iff C k has support of size 1) or is an (s, p)-cycle with 2 ≤ s ≤ t and p = β k /(β k + w k ).

Putting things together
Proof of Theorem 3.1. From Lemmas 6.7 and 6.4. Remark 6.8. One can see that see that, as in Theorem 3.2, we obtain a triply exponential explicit bound, i.e, there exists D(α(P)) > 0 such that if µ ∈ (0, 0.99], then

Local Variance
In this section we state and prove a generalization of the low-influence theorem from [Mos10]. We assume that the reader is familiar with Fourier coefficientsf (σ) and the basics of discrete function analysis, for details see, e.g., Chapter 8 of [O'D14]. [Mos10] shows that ρ(P) < 1 implies that P is set hitting for low-influence functions. We extend this result to a weaker notion of influence. In particular, we show that P is set hitting for functions with Ω(1) measure and o(1) largest Fourier coefficient. The main result of this section is Theorem 3.3.
We remark that Theorem 3.3 does not require equal marginals. The rest of this section contains the proof of Theorem 3.3. First, from Corollary 5.15 and Theorem 4.1 it is easy to establish 3 the following: Theorem 7.1. Let X be a random vector distributed according to an ℓ-step distribution P with ρ(P) ≤ ρ < 1 and let ǫ ∈ [0, 1).
Then, for every f : Proof. We prove the contraposition.
If f is not ǫ-resilient up to size k, by definition of f ⊆S it implies that there exist S ⊆ [n] with |S| = k and x such that

But this gives
as required.
Using Lemma 7.4 we can weaken the assumption in Theorem 7.1 such that it only requires that all Fourier coefficients of degree at most k are small: Proof of Theorem 3.3. From Theorem 7.1, there exists k := k(P, µ (1) , . . . , µ (ℓ) ) such that if f (1) , . . . , f (ℓ) are all 1/2-resilient up to size k, then (6) holds. Therefore, it is sufficient to show that the functions f (j) are indeed 1/2-resilient up to size k if the parameter γ is chosen small enough. By Claim 7.3, if max σ:0<|σ|≤k |f (j) (σ)| ≤ γ, then for any S ⊆ [n] with |S| = k we have Var (f (j) ) ⊆S (X (j) ) ≤ 2 k γ 2 . With that in mind it is easy to choose γ such that Lemma 7.4 can be applied to each f (j) .

Multiple Steps of a Markov Chain
Next, we consider the case where the distribution P is such that the random variables X (1) , X (2) , . . . , X (ℓ) form a Markov chain.
Definition 8.1. Let P be a an ℓ-step distribution with equal marginals and let X = (X (1) , . . . , X (ℓ) ) be a random variable distributed according to P. We say that P is generated by Markov chains 4 if for every j ∈ {2, . . . , ℓ} and x (1) , . . . , x (j) ∈ Ω we have Observe that since we still require P to have equal marginals, the marginal π is then simply a stationary distribution of the chain.
In this case, we give a reduction to Theorem 3.1 to prove a bound that does not depend on ρ(P):

Theorem 8.2. Let Ω be a finite set and P a probability distribution over Ω ℓ with equal marginals generated by Markov chains. Let tuples
Then, for every f : where the function c() is positive whenever α(P) > 0.
Proof. Let P be a distribution generated by Markov chains with α := α(P) > 0 and let The proof is by induction on ℓ. For ℓ = 2, apply Theorem 3.1 directly. For ℓ > 2, define the function g : Ω → [0, 1] as Applying Theorem 3.1 for the distribution of the last two steps, Now we have where (44) holds since P is generated by Markov chains, (45) is due to f ≥ g pointwise and (46) is an application of the induction and (43).
Remark 8.3. Unfortunately, this proof worsens the explicit bound. One can check that for a Markov-generated distribution with ℓ steps the dependence on µ is a tower of exponentials of height 3(ℓ − 1).

Polynomial Same-Set Hitting
The property of set hitting establishes a lower bound on E ℓ j=1 f (j) (X (j) ) that is independent of n. However, it might be the case that this bound is very small, perhaps far from the best possible one. In particular, our bound from Theorem 3.2 is triply exponentially small, and the bound from Theorem 1.2 is not even primitive recursive.
Definition 9.1. A distribution P is polynomially set hitting (resp. polynomially same-set hitting) if there exists C ≥ 0 such that P is (µ, µ C )-set hitting (resp. same-set hitting) for every µ ∈ (0, 1]. As a matter of fact, [MOS13] (cf. Theorem 1.4) establishes that all distributions that are set hitting are also polynomially set hitting. We suspect that this is also the case for two-step same-set hitting, but this remains an open problem.
However, it is possible to harness reverse hypercontractivity to show that all symmetric two-step distributions are polynomially same-set hitting: Theorem 9.2. Let a two-step probability distribution with equal marginals P be symmetric, i.e., P(x, y) = P(y, x) for all x, y ∈ Ω. If α(P) > 0, then P is polynomially same-set hitting.
We omit the proof of Theorem 9.2, noting that the idea is similar as in Section 6: one performs an obvious convex decomposition of P into cycles of length two and applies the result of [MOS13] to each term of this decomposition.

A Proof of Theorem 4.1
Our proof of Theorem 4.1 follows in this appendix. It is only a slight adaptation of the argument from [Mos10], but we include it in full for the sake of completeness. We first restate the theorem and discuss the differences between our proof and the one in [Mos10]: Theorem 4.1. Let X be a random vector distributed according to (Ω, P) such that P has equal marginals, ρ(P) ≤ ρ < 1 and min x∈Ω π(x) ≥ α > 0.
Then, for all ǫ > 0, there exists τ := τ (ǫ, ρ, α, ℓ) > 0 such that if functions f (1) , . . . , f (ℓ) : Furthermore, there exists an absolute constant C ≥ 0 such that for ǫ ∈ (0, 1/2] one can take Theorem 4.1 is very similar to a subcase of Theorem 1.14 from [Mos10]. We make a stronger claim with one respect: in [Mos10] the influence threshold τ depends among others on: while our bound depends only on the smallest marginal probability: The main differences to the proof in [Mos10] are: proves the base case ℓ = 2 and then obtains the result for general ℓ by an inductive argument (cf., Theorem 6.3 and Proposition 6.4 in [Mos10]). Since the induction is applied to functions f (1) and g := ℓ j=2 f (j) , where g is viewed as a function on a single-step space, the information on the smallest marginal is lost in the case of g. To avoid this, our proof proceeds directly for general ℓ. However, the structure and the main ideas are really the same as in [Mos10].
• In hypercontractivity bounds for Gaussian and discrete spaces (Theorem A.42 and Lemma A.43) we are slightly more careful to obtain bounds which depend on α rather than α * (as defined in (48) and (47)). This better bound is then propagated in the proof of the invariance principle. .
The proof can be generalized in several directions, but for the sake of clarity we present the simplest version sufficient for our purposes.

A.1 Preliminaries -the general framework
We start with explaining the notation of random variables and L 2 spaces that we will use throughout the proof.
Definition A.1. Let (Ω, F, P) be a probability space. We define the real inner product space L 2 (Ω, P) as the set of all square-integrable functions f : Ω → R, i.e., the functions that satisfy with inner product defined as Remark A.2. As we will see shortly, if X is a random variable sampled from Ω according to P, the equations (49) and (50) can be written as Remark A.3. We omitted the event space F in the definition of L 2 (Ω, P). This is because F is always implicit in the choice of the measure P.
In particular, when P is discrete, of course we choose F to be the powerset of Ω. When P is continuous over R n , we use the "standard" real event space, i.e., the completion of the Borel algebra.
While this will not be our usual way of thinking, at this point it makes sense to introduce the formal definition of a random variable: a function from a probability space to some set.
Definition A.4. Let (Σ, F, P) be a probability space. We say that X is a random variable over a set Σ ′ if it is a measurable function X : Σ → Σ ′ .
As usual, we will assume throughout the proof that all random variables are induced by some underlying probability space (Σ, F, P).
Using this, a random variable induces some distribution, which we can study.
Definition A.5. We say that a random variable X over a set Ω is distributed according to a probability space (Ω, P) if for every event A ∈ F: Definition A.6. Let X be a random variable distributed over Ω. By L 2 (X) we denote the inner product space of random variables that correspond to square-integrable functions f : Ω → R: with the inner product given as Remark A.7. We consider the formal setting again, i.e., suppose (Σ, F, P) is the underlying probability space, and X : Σ → Ω a random variable. Then, L 2 (X) is a subspace of L 2 (Σ, P). Intuitively, it contains all real valued functions which "depend only on X".
Example A.8. Fix (Ω, P) to be the uniform distribution on Ω := {0, 1, 2} and let X be distributed according to (Ω, P). Then L 2 (X) has dimension three and one of its orthonormal bases is After this point, we will have no need to refer explicitly to the underlying probability space (Σ, F, P) anymore. Nevertheless, it will be useful to remember that random variables are functions of this underlying space.
It immediately follows from the definitions that: Lemma A.9. Let X be a random variable distributed according to (Ω, P). Then L 2 (X) is isomorphic to L 2 (Ω, P).

A.2 Preliminaries -orthonormal ensembles and multilinear polynomials
In this section we introduce orthonormal ensembles and multilinear polynomials over them.
Definition A.10. We call a finite family (X 0 , . . . , X p ) of random variables orthonormal if they satisfy E[X 2 k ] = 1 for every k and E[X j X k ] = 0 for every j = k. Definition A.11. We call a finite family of orthonormal random variables X = (X ⋆,0 = 1, X ⋆,1 , . . . , X ⋆,p ) an orthonormal ensemble. We call p the size of the ensemble.
An ensemble sequence is a sequence of independent families of random variables X = (X 1 , . . . , X n ) such that each X i is an orthonormal ensemble X i = (X i,0 = 1, X i,1 , . . . , X i,p ) of the same size p. We call n the size of the sequence.
The notation X ⋆,k is a little awkward, but we do not need to use it often. The reason for it is that we want to to make sure that one cannot confuse one of the random variables X ⋆,k within an orthonormal ensemble with the orthonormal ensemble X i itself. Whenever a random variable X i,k is part of an ensemble X i , there is no reason to use the ⋆-symbol. Instead we use the index of the ensemble.
Definition A.12. We call two ensemble sequences X = (X 1 , . . . , X n ) and Y = (Y 1 , . . . , Y m ) compatible if n = m and the sizes of the individual ensembles X i and Y i are the same.
Definition A.13. Let X = (X 1 , . . . , X n ) be an ensemble sequence such that each ensemble X i is of size p.
For a tuple σ we define its support as supp(σ) := {i ∈ [n] : σ i = 0} and its degree as the size of its support: |σ| := | supp(σ)|. Also, we will write the tuple (0, . . . , 0) as 0 n . Let a multilinear polynomial P compatible with X be given. Then, P (X ) is what one expects: the random variable obtained by evaluating the polynomial on the given input. Analogously, if σ is a tuple as above we write X σ for the random variable corresponding to the evaluation of the monomial x σ . Lemma A.14. Let X be an ensemble sequence and σ, τ two tuples whose monomials x σ , x τ are compatible with X . Then, and now we can use the orthonomality of each ensemble X i . For the second part, we apply the first on τ = 0 n . Definition A.15. Given a multilinear polynomial P (x) = σ α(σ)x σ we define its following properties: The next lemma states that the formal expressions defined above are consistent with the corresponding probabilistic interpretations for every ensemble sequence.

Together this gives
as claimed.
Then, let P >d := S:|S|>d P S be P restricted to tuples with the degree greater than d. We also define P =d , P ≤d etc. in the analogous way.
Lemma A.18. Let P and Q be multilinear polynomials compatible with an ensemble sequence X . Then, Proof. It is enough to show that for S = T E [P S (X )Q T (X )] = 0 .
Definition A.23. We call an orthonormal ensemble G ⋆ of size p Gaussian if random variables G ⋆,1 , . . . , G ⋆,p are independent N (0, 1) Gaussians. We say that an ensemble sequence G = (G 1 , . . . , G n ) is Gaussian if for each i ∈ [n] the ensemble G i is Gaussian.
We remark than as in all ensemble sequences, in a Gaussian ensemble sequence we have G i,0 ≡ 1 for all i.

A.3 Preliminaries -ensemble collections
In this section we recall the setting of Theorem 4.1 and introduce some other concepts we will need throughout the proof.
From now on we will always implicitly assume that all multi-step distributions P have equal marginals (denoted as π). This assumption is not necessary, but sufficient for our main purpose, while making the notation easier.
Definition A.25. Let X be a random variable distributed according to a single-step, single-coordinate distribution (Ω, π). We say that an orthonormal ensemble X ⋆ is constructed from X if the elements of X ⋆ form an orthonormal basis of L 2 (X).
Similarly, let X be a random vector distributed according to (Ω, π). We say that an ensemble sequence X = (X 1 , . . . , The definition of ensemble sequences requires that X i,0 ≡ 1 for every i; of course we can find a basis of L 2 (X i ) which satisfies this requirement, so that ensemble sequences constructed from X indeed exist.
Lemma A.26. Let X be an ensemble sequence constructed from a random vector X distributed according to (Ω, π). Assume that the size of each ensemble X i is p. Then the set of monomials is an orthonormal basis of L 2 (X).
Proof. Observe that the dimension of L 2 (X i ) is p + 1, (note that it is the support size of the single-coordinate distribution (Ω, π)). Hence, the dimension of L 2 (X) is (p+1) n , which equals the size of B. Therefore, it is enough to check that B is orthonormal, which is done in Lemma A.14.
Definition A.27. Let X be an ensemble sequence constructed from a random vector X distributed according to (Ω, π).
For a function f : Ω → R and a multilinear polynomial P compatible with X we say that f (X) is equivalent to P if it always holds that f (X) = P (X ) .
Recall the operator T ρ from Definition A.22. We show that it has a natural counterpart in L 2 (Ω, π).
We define a linear operator T ρ : L 2 (Ω, π) → L 2 (Ω, π) as is a random vector with independent coordinates distributed such that Y ρ,x i = x i with probability ρ and Y ρ,x i is (independently) distributed according to (Ω, π) with probability (1 − ρ).
The next lemma states that taking operator T ρ preserves the equivalence of functions and polynomials: Lemma A.29. Let X be an ensemble sequence constructed from a random vector X distributed according to (Ω, π).
Let ρ ∈ [0, 1], f : Ω → R and P be a multilinear polynomial equivalent to f . Then, T ρ P and T ρ f are equivalent, i.e., with probability ρ, a random ensemble distributed as X i with probability 1 − ρ.
Note that Y ρ,x is not an ensemble sequence, but this will not cause problems.
Writing P (x) = σ α(σ) · x σ we can calculate Since x was arbitrary, the claim is proved.
Recall Definition A.23. In the proof we will construct a tuple of ensemble sequences X = (X (1) , . . . , X (ℓ) ) from a random vector X and consider relations between those sequences and compatible Gaussian ensemble sequences. To this end, we need to introduce the Gaussian equivalent of marginal ensemble sequences X (j) . Lemma A.31. Let a random tuple X = (X (1) , . . . , X (ℓ) ) be distributed according to a single-coordinate distribution (Ω, P). Let X ⋆ = (X (1) Then, there exist Gaussian orthonormal ensembles G ⋆ = (G Proof. Consider (Ω, P) as a single-step probability space, and let X be the corresponding random variable. Let now Z ⋆ be an orthonormal ensemble constructed from X. Recall that this means that the elements of Z ⋆ form an orthonormal basis of L 2 (X).
Let H ⋆ be a Gaussian ensemble sequence compatible with Z ⋆ . Define the map Ψ : L 2 (X) → V (H ⋆ ) by linearly extending Ψ(Z ⋆,k ) := H ⋆,k . In this way Ψ becomes an isomorphism between L 2 (X) and V (H ⋆ ) (and as such it preserves inner products).
Since L 2 (X (j) ) is a subspace of L 2 (X), we can define G ⋆,k ). Since Ψ preserves inner products we get (63).
We still need to argue that for each j ∈ [ℓ] the orthonormal ensemble G ⋆ is an ensemble sequence follows from (63) for j 1 = j 2 = j (note that Ψ(1) = 1).
The variables G (j) ⋆,k are clearly jointly Gaussian, since they can be written as sums of independent Gaussians. By (63), their covariance matrix is identity. This finishes the proof, since joint Gaussians with the identity covariance matrix must be independent.
Since the proof of Lemma A.31 is somewhat abstract, we illustrate the construction of G ⋆ with an example.

A.4 Hypercontractivity
In this section we develop a version of hypercontractivity for products of multilinear polynomials. Our goal is to prove Lemma A.43. Recall the operator T ρ from Definition A.22.
Definition A.35. Let X be an ensemble sequence and let 1 ≤ p ≤ q < ∞ and ρ ∈ [0, 1]. We say that the sequence X is (p, q, ρ)-hypercontractive if for every multilinear polynomial P compatible with X we have Definition A.36. Let X be an orthonormal ensemble and let 1 ≤ p ≤ q < ∞ and ρ ∈ [0, 1]. We say that the ensemble X is (p, q, ρ)-hypercontractive if the one-element ensemble sequence X := (X ) is (p, q, ρ)-hypercontractive.
We start with stating without proofs the hypercontractivity of orthonormal ensembles that we use in the invariance principle: . Let X be an orthonormal ensemble constructed from a random variable X distributed according to a (single-coordinate, single-step) probability space (Ω, π) with min x∈Ω π(x) ≥ α ≥ 0.
Yet again, we omit the proof of Theorem A.39. We remark that it is well-known as the tensorization argument. The argument can be found, e.g., in the proof of Proposition 3.11 in [MOO10].
Definition A.40. Let X be a random vector distributed according to a (single-step, tensorized) probability space (Ω, π). We say that an ensemble sequence X = (X 1 , . . . , X n ) is X-Gaussian-mixed if for each i ∈ [n]: • Either X i is constructed from the random variable X i , • or X i is a Gaussian ensemble.
Theorems A.37, A.38 and A.39 immediately imply: Corollary A.41. Let X be a random vector distributed according to a probability space (Ω, π) with min x∈Ω π(x) ≥ α ≥ 0 and let X be an X-Gaussian-mixed ensemble sequence.
Theorem A.42. Let X be a random vector distributed according to a probability space (Ω, π) with min x∈Ω π(x) ≥ α > 0 and let X be an X-Gaussian-mixed ensemble sequence. Let P be a multilinear polynomial compatible with X of degree at most d. Then, Proof. Let ρ := α 1/6 /2 and write P (X ) = σ β(σ)X σ . By Corollary A.41, definitions of T ρ and E[P 2 ], and the degree bound on P , Lemma A.43. Let X be a random vector distributed according to a (multi-step) probability space with equal marginals (Ω, P) with min x∈Ω π(x) ≥ α > 0. Let S (1) , . . . , S (ℓ) be ensemble sequences such that S (j) is X (j) -Gaussian-mixed. Let P (1) , . . . , P (ℓ) be multilinear polynomials such that P (j) is compatible with S (j) and also deg(P (j) ) ≤ d.

A.5 Invariance principle
In this section we prove a basic version of invariance principle for multiple polynomials. We say that a function is B-smooth if all of its third-order partial derivatives are uniformly bounded by B: Definition A.44. For B ≥ 0 we say that a function Ψ : R ℓ → R is B-smooth if Ψ ∈ C 3 and for every j 1 , j 2 , j 3 ∈ [ℓ] and every x = (x (1) , . . . , x (ℓ) ) ∈ R ℓ we have Theorem A.45 (Invariance Principle). Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with min x∈Ω π(x) ≥ α > 0.
Due to Claim A.47, we will estimate E Ψ(P (U (i−1) )) − Ψ(P (U (i) )) for every i ∈ where A (j) and B We note for later use that the construction gives us The rest of the proof proceeds as follows: we calculate the multivariate second order Taylor expansion (i.e., with the third-degree rest) of the expression, getting around the point A := (A (1) , . . . , A (ℓ) ). We will see that: • All the terms up to the second degree cancel in expectation due to the properties of ensemble sequences.
• The remainder, which is of the third degree, can be bounded using that Ψ is Bsmooth, properties of P (j) i , and hypercontractivity, in particular Lemma A.43.
where random variables R T and R U are such that Proof. We show only (68) and the bound on E[|R T |], the proofs for the ensemble sequence U being analogous. As a preliminary remark, note that since all the random ensembles we are dealing with are hypercontractive, and since Ψ is B-smooth, all the terms in the expressions above have finite expectations.
Keeping in mind both decompositions from (65), by Theorem A.48 where Since E[X (j) i,k ] = 0, and all other terms are independent of coordinate i, we have which together with (71) yields (68).
Proof. First, we need to show that the second-order terms in (68) and (69) cancel out. Since by Lemma A.31 for every j 1 , j 2 ∈ [ℓ] and k 1 , k 2 > 0: and since all the other terms are independent of coordinate i, we have Therefore, by (68), (69) and (70), , as claimed.

A.6 A tailored application of invariance principle
Definition A.52. Let P be a multilinear polynomial and γ ∈ [0, 1]. We say that P is γ-decaying if for each d ∈ N we have We also say that a tuple of multilinear polynomials P = (P (1) , . . . , Note that if a multilinear polynomial P is γ-decaying, then, in particular, Our goal in this section is to prove a version of invariance principle for γ-decaying multilinear polynomials and the function χ: Theorem A.53. Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with min x∈Ω π(x) ≥ α, α ∈ (0, 1/2]. Let P = (P (1) , . . . , P (ℓ) ) be such that P (j) is a multilinear polynomial compatible with the ensemble sequence X (j) .

2.
A γ-decaying multilinear polynomial does not have bounded degree.
We will deal with those problems in turn.

A.6.1 Approximating χ with a C 3 function
To apply Theorem A.45, we are going to approximate φ and χ with C 3 (in fact, C ∞ ) functions.
For that we need to introduce the notion of convolution and a basic calculus theorem, whose proof we omit (see, e.g., Chapter 9 in [Rud87]): We say that f has compact support if there exists a bounded interval I that is a support of f . Definition A.55. The convolution f * g of two continuous functions f, g : R → R, at least one of which has compact support, is Theorem A.56. Let functions f, g : R → R be such that f is continuous on R, g ∈ C ∞ and g has compact support. Then, (f * g) ∈ C ∞ . Furthermore, for every k ∈ N and x ∈ R: We also need a special density function with support [−1, 1]: Theorem A.57. There exists a function ψ : R → R ≥0 such that all of the following hold: • ψ has support [−1, 1].
Now we are ready for the approximation of χ: Definition A.63. Let λ ∈ (0, 1/2). Define function χ λ : R ℓ → R as From Lemma A.62 we easily get: Corollary A.64. Let λ ∈ (0, 1/2). The function χ λ has the following properties: 2) There exists a universal constant B ≥ 0 such that χ λ is B λ 3 -smooth. After developing the approximation we are ready to prove the invariance principle for the function χ: Theorem A.65. Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with min x∈Ω π(x) ≥ α > 0.
This finishes the proof of Theorem A.53.

A.7 Reduction to the γ-decaying case
To apply Theorem A.53 we need to show that "smoothing out" of multilinear polynomials P (1) , . . . , P (ℓ) does not change the expectation of their product too much.
Recall Definitions A.22 and A.28 for the operator T ρ . Our goal in this section is to prove: Theorem A.69. Let X be a random vector distributed according to (Ω, P) with ρ(Ω, P) ≤ ρ ≤ 1. Let Z be an ensemble sequence constructed from X and X (1) , . . . , X (ℓ) be ensemble sequences constructed from X (1) , . . . , X (ℓ) , respectively.
To give a formal argument, we use yet another ensemble sequence: let j ∈ [ℓ]. We define Y (j) to be an ensemble sequence constructed from X [ℓ]\{j} . Furthermore, let Note that since A (j) ∈ L 2 (X [ℓ]\{j} ), there exists a multilinear polynomial Q (j) compatible with Y (j) such that Proof. By definition of Q (j) .

A.8 Gaussian reverse hypercontractivity
Definition A.75. Let L 2 (R n , γ n ) be the inner product space of functions with standard N (0, 1) Gaussian measure.
Our goal in this section is to prove the following bound: Theorem A.76. Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with ρ(P) ≤ ρ < 1 and such that each orthonormal ensemble in G has size p.
In order to prove Theorem A.76, we will use a multidimensional version of Gaussian reverse hypercontractivity stated as Theorem 1 in [CDP15] (cf. also Corollary 4 in [Led14]). To reduce Theorem A.76 to Theorem A.78 we first look at a single-coordinate variance bound for ensembles from X . Next, we will extend this bound to multiple coordinates and ensembles from G. Lemma A.80. Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with ρ(P) ≤ ρ < 1 and such that each orthonormal ensemble in X has size p.
Fix i ∈ [n] and for ease of notation let us write X (j) = (X  since A j ∈ L 2 (X i,k } ∈ R: Proof. Since ensembles X i are independent, by Lemma A.80, Lemma A.82. Let (X, X , G) be an ensemble collection for a probability space (Ω, P) with ρ(P) ≤ ρ < 1.

A.9 The main theorem
We recall the low-influence theorem that we want to prove: Theorem 4.1. Let X be a random vector distributed according to (Ω, P) such that P has equal marginals, ρ(P) ≤ ρ < 1 and min x∈Ω π(x) ≥ α > 0.
We need to define some new objects in order to proceed with the proof. Let (X, X , G) be an ensemble collection for (Ω, P).
For j ∈ [ℓ], let P (j) be a multilinear polynomial compatible with X (j) and equivalent to f (j) (X (j) ). For some small γ > 0 to be fixed later let Q (j) := T 1−γ P (j) . Finally, letting p be the size of each of the ensembles X Note that it might be impossible to write R (j) as a multilinear polynomial, but it will not cause problems in the proof. Finally, let µ ′(j) := E R (j) (G (j) ) . The proof proceeds by decomposing the expression we are bounding into several parts: We use the theorems proved so far to bound each of the terms (85), (86) and (87) in turn. First, we apply Theorem A.69 to show that (85) has small absolute value. Then, we use the invariance principle (Theorem A.53) to argue that (86) has small absolute value. Finally, using Gaussian reverse hypercontractivity (Theorem A.76) we show that (87) is bounded from below by (roughly) ℓ j=1 µ (j) ℓ/(1−ρ 2 ) . We proceed with a detailed argument in the following lemmas. In the following assume w.l.o.g that ǫ ≤ 1/2 and α ≤ 1/2. Proof. Note that for every j ∈ [ℓ] the polynomial Q (j) is γ-decaying and that it has bounded influence for every i ∈ [n]: Inf i (Q (j) ) ≤ Inf i (P (j) ) = Inf i (f (j) (X (j) ) ≤ τ .
Lastly, we need to show that the difference between ℓ j=1 µ ′(j) and ℓ j=1 µ (j) is small.