Kneser graphs are like Swiss cheese

We prove that for a large family of product graphs, and for Kneser graphs $K(n,\alpha n)$ with fixed $\alpha<1/2$, the following holds. Any set of vertices that spans a small proportion of the edges in the graph can be made independent by removing a small proportion of the vertices of the graph. This allows us to strengthen the results of [DinurFR06] and [DinurF09], and show that any independent set in these graphs is almost contained in an independent set which depends on few coordinates. Our proof is inspired by, and follows some of the main ideas of, Fox's proof of the graph removal lemma [Fox11].


Introduction
The celebrated triangle removal lemma of Ruzsa and Szemerédi [11] has a deceptively simple formulation, yet is in fact a deep structural result. It is known to imply Roth's theorem on three term arithmetic progressions, and its generalizations to hypergraphs imply Szemerédi's theorem on arithmetic progressions. The statement is as follows.
Theorem 1 (Ruzsa-Szemerédi). For every ε > 0 there exists a δ > 0 such that if a graph on n vertices spans fewer than δ n 3 triangles, it can be made triangle free by removing at most εn 2 edges.
This statement can be formulated as a statement regarding a 3-uniform hypergraph whose vertices are the edges of the complete graph on n vertices, and whose edges are the triangles. The statement then says that every set of vertices that spans few edges can be made independent by removing a small number of vertices.
In this paper we take this statement "one level down" to graphs. We show that a large family of graphs has an "edge removal phenomenon," i.e., any set of vertices that spans o(|E|) edges can be made independent by removing o(|V |) vertices. In particular, we shall prove this for product graphs (see Theorem 3.1 for the precise statement), a good example of which is K n 3 , in which the set of vertices is {0, 1, 2} n , and two vertices span an edge if they differ in all coordinates. (Note that this is very different from the Hamming graph, where neighbors differ in precisely one coordinate.) We will also prove this for Kneser graphs K(n, k) (see Theorem 7.2), where the vertices are the subsets of size k of [n] for some 0 < k < n/2, and two vertices span an edge if they are disjoint. Our proof works as long as the ratio k/n is bounded away from zero.
So, as in Swiss cheese, although these graphs have plenty of very large holes (independent sets), there are no "uniformly sparse sets": any sparse set is nothing else than a part of one of the big independent sets, with a small perturbation.
Juntas. The context in which we encountered this problem was in trying to characterize independent sets in graph products. In K n 3 , for instance, the largest independent sets are those obtained by fixing a single coordinate ("dictatorships"). Other examples of independent sets that depend on few coordinates (so called "juntas") are, say, all vertices that have at least two "0" entries in their first three coordinates. Similarly, in the Kneser graph, the largest independent sets are determined by a single "coordinate", namely, all sets containing a specific element (this is part of the Erdős-Ko-Rado theorem). And, as in K n 3 , there exist junta-like independent sets capturing a constant proportion of the vertices, which are determined by few elements.
In [3] and [2] it is proven that any independent set in these graphs can essentially be captured by a set depending on few coordinates. E.g., for every ε > 0 there exist a δ > 0 and a positive integer j such that for any independent (or very sparse) set U ⊂ V (K n 3 ) there exists a set of coordinates J ⊂ [n] with |J| ≤ j and a set T ⊂ {0, 1, 2} J such that all but ε3 n of the vertices in U have their J-coordinates in T . Furthermore, this set T "explains" why U is independent, because T itself is extremely sparse in the graph K J 3 . A similar statement was proven there for a large class of product graphs, and for the Kneser graphs.
However, there was a fly in this ointment (or a thorn in the sheep-tail-fat, as we say): we conjectured that there must exist a set T as above that is not only sparse in the product graph of dimension j, but actually independent, thus providing a complete explanation for the sparseness of U; i.e., we conjectured that any independent (or very sparse) set is almost completely contained in a set depending on few coordinates that is truly independent (as opposed to merely sparse). In this paper we manage to settle this issue and prove the conjecture. The key to this is applying the main theorem in this paper, Theorem 3.1, to the set T , which belongs to the j'th power of the base graph. We show that the edge removal phenomenon holds in the product graph, thus T can be slightly altered to produce a truly independent set (as opposed to a sparse one).
Related work. The structure of our proof is very closely modeled on Fox's proof of the graph removal lemma, [6], where he improved the longstanding bound on the dependence of the constants in the lemma (most famously, in the triangle removal lemma). One of the differences is that we have, to use Fox's terminology, a shattering lemma that is special to this setting, and is nothing else than a (minor) generalization of the main result of [3] (Theorem 2.1). It states that whenever two large sets of vertices span few edges between them, it is because there is a small set of coordinates that these two sets are strongly correlated with. This is a consequence of a central theorem in [4], which in turn, relies heavily on the invariance principle of [10].
Our main result is also reminiscent of, and related to, removal lemmas in groups, see [7], [9], and for systems of equations over finite fields, [8]. Theorem 7.2, which deals with Kneser graphs, is closely related to the work [1], where a removal lemma is proven for the special case of sets that are close in size to a maximal independent set in the Kneser graph K(n, k), for any value of k.
Structure of the paper. In Section 2 we present some preliminary definitions. In Section 3 we state our main theorem and sketch the proof. In Section 4 we reduce to a case of a "matching like" function, which will eventually lead to a substantial improvement in the resulting bounds. In Section 5 we show that any non-negative function has a good approximation (in a specific sense) by a function that depends on few coordinates. In Section 6 we complete the proof of our main theorem. In Section 7 we state and prove the Swiss Cheese Theorem regarding Kneser graphs, showing that they too exhibit an edge removal phenomenon. Finally, in the appendix we explain how to extend the main theorem from [3] to the form that we use in this paper.

Preliminaries
In the rest of the paper we fix a set V with a reversible, irreducible, aperiodic Markov chain on it given by a matrix A. All functions and constants we encounter from now on may depend on V and A. Let µ denote the unique stationary measure of A on V . We will be working with V n , A ⊗n and µ ⊗n , often just writing A and µ as shorthand to avoid cumbersome notation. Whenever we take expectation of a function on V n it is according to µ, and we use µ also to define the standard inner product between functions on V n . We think of the ground set V n as the vertices of a (product) graph, and of the non-zero-probability transitions as edges. The weight of a (directed) edge (x, y) is This is the asymptotic probability of a step in the random walk governed by A to traverse (x, y). Equivalently it is the probability of (u, w) = (x, y), where u is chosen by the stationary distribution, and w is chosen from u's neighbors according to the probabilities dictated by the transition matrix. We will talk about "the weight of the edges spanned" by a set U, or a function f , simply meaning 1 U , A1 U , or f , A f . Consequently, a set U ⊂ V n is called independent if 1 U , A1 U = 0. We will say that a function Capturing sparse sets using juntas. A crucial ingredient in our proof is the following variant on the main result from [3]. In the appendix we explain how it follows from previous work.
This easily implies the desired strengthening of [3] which was the main motivation for this work.

Sketch of proof of the main theorem
We begin the sketch for the case of sparse sets (i.e., for g having range {0, 1}), as that captures all the main ideas. We will briefly mention the extension to functions at the end of this subsection. Given a set U ⊂ V n which is ε-far from being independent we wish to show that it spans many edges, i.e., that Notice that any set I ⊂ [n] naturally defines a partition of V n into |V | |I| parts according to the coordinates in I. We can then study how 1 U behaves on these parts. If it is constant (zero or one) on each part, then I perfectly captures U. Letting W be a random variable whose value is the conditional I. Furthermore, H is always non-positive, and is monotone increasing with respect to refining the partition induced by I by adding further coordinates. For a positive integer r, an r-refinement of a partition induced by I as above, is one which involves adding r new coordinates per every part of the partition, thus adding up to r · |V | |I| new coordinates. Given a set U, beginning with the trivial partition of V n (corresponding to I 0 = / 0) we will iteratively apply r-refinements, producing I 0 ⊂ I 1 ⊂ · · · attempting to substantially increase H(U, I i ) in each step, and stop when this is no longer possible. On the one hand, since H ≤ 0, the number of steps, and hence the total number of coordinates involved in the final partition is bounded from above by some constant k.
On the other hand, we will show that if U is ε-far from an independent set, and 1 U , then for any partition coming from at most k coordinates there is an r-partition that further increases H substantially. So if δ is too small this yields a contradiction. In the proof we will sketch the exact dependence between all parameters involved (including r and k).
The crux of the proof, then, is how one can utilize the fact that U is sparse (i.e., that δ is small), U is ε-far from being independent, and |I| is not too large, in order to show the existence of an r-refinement which substantially increases H(U, I). Our engine for this is (a slight variation on) the result from [3], where the fuel of this engine is the invariance principle of [10] (as applied in [4]). Whenever two parts of the partition, say X and Y , span few edges between them, our engine will produce a refinement of the partition, according to a bounded number of new coordinates, such that on at least one of the two parts, say X, the resulting increase in H will be proportional to the measure of U ∩ X. This approach is sufficient to prove our main theorem, with a Tower(O(1/ε))-type dependence between ε and δ .
However, using a key idea from Fox's improvement to the bounds in the graph removal lemma, [6], we can cut this dependence down to Tower(O(log(1/ε))). This involves replacing the set U with a subset U , the support of a maximal matching contained in it. The advantage of U is that no independent set contained in it captures more than half of its mass. As the proof will show, this helps avoid slowly whittling away at U, and speeds up the relative increase in H in every step.
One last element in the proof of our main theorem: the use of functions with range [0, 1] instead of sets. In the final part of the paper, when applying our result to sets in Kneser graphs, we will "extrapolate" these sets to [0, 1]-valued functions on {0, 1} n . Therefore it makes sense for us to make a few small adaptations in our presentation, replacing sets (which may be thought of as functions with range {0, 1}) with functions with range [0, 1]. It turns out that this natural variant is not much harder to treat than the original one.

Reducing to matching-like functions
Here we observe that it suffices to prove Theorem 3.1 in the special case that f is a matching-like function, which is a function not having too much weight on any independent set. The use of this innocuous condition, which replaces the condition of being far from independent, leads to a substantial improvement.
We say that f is matching-like if for any independent set W ⊆ V n , The terminology comes from the observation that when µ is the uniform measure, the indicator DISCRETE ANALYSIS, 2018:2, 18pp. function of the vertices touched by any matching is "matching-like," since no independent set can contain both ends of an edge.
The following claim shows that for any function g : V n → [0, 1] we can find a matching-like function f such that g ≥ f pointwise, and such that the set of vertices x such that f (x) < g(x) is an independent set. We note that when the measure µ is the uniform measure, and g is the indicator of a set U, then we can simply take any maximal matching inside U and let f be the indicator of its vertices. The three properties below are then easy to verify.
2. f is matching-like, and Proof. Note that the set of all functions f that fulfill conditions (1) and (2) is non-empty (as it contains the identically 0 function), and compact. Hence there exists some f in this set which maximizes ∑ f (x). This f must also fulfill condition (3), as otherwise we would have an edge {a, b} with f (a) < g(a) and . If such an edge were to exist, there would exist some small positive constant γ such that we could add γ/µ(a) to f (a), and γ/µ(b) to f (b), yielding a new function f that still fulfills condition (1) and (3). This function will also fulfill condition (2) since its expectation is greater than that of f by 2γ, but its weight on any independent set is greater by at most γ (because no independent set contains both a and b.) So the existence of f contradicts the maximality of f .
We now state a theorem quite similar to our main one, Theorem 3.1, for the special case of matchinglike functions, and then, using Claim 4.2, we can easily deduce Theorem 3.1.
This now easily implies our main theorem.
Proof of Theorem 3.1. Given g we invoke Claim 4.2 to produce an appropriate matching-like f . Note that since the set U : Since g ≥ f pointwise it follows that g, Ag > δ as required.

The potential argument
Given a function f : V n → [0, 1], and a set of coordinates I ⊆ [n] we wish to study how well f (x) is predicted by the coordinates of x indexed by I. If f depends only on the coordinates in I, then we think of f as being perfectly correlated with I, in which case f is constant on every part of the partition of V n induced by I. Otherwise, we can improve this correlation by refining the partition according to additional coordinates. In Definition 5.1 we set a (partially arbitrary) benchmark for how much this refinement improves the correlation, and define any refinement that succeeds as "substantially improving the correlation". If for every x in V I we partition {x} ×V [n]\I according to r additional coordinates we call this an "r-refinement". Our main goal in this section is the proof of Lemma 5.2 that states, roughly, that for any f and r there exists a (not-too-large) set of coordinates I whose correlation with f cannot be substantially improved by r-refinement. Definition 5.1 and the statement of Lemma 5.2 are the only parts of this section used in the rest of the paper.
Just to ensure the notation in (b) is clear: . As mentioned, a set of coordinates I ⊂ [n] is perfectly correlated with f if for every x ∈ V I it holds that f (x, ·) is constant. Substantially improving the correlation of I with f by refinement, means that for a portion of inputs x ∈ V I , which are responsible for at least half of the expectation of f , it holds that by an appropriate choice of J x , (at most r additional coordinates from [n] \ I), partitioning {x} ×V [n]\I according to these additional coordinates yields a partition where in many parts the conditional expectation of f drops substantially. In a sense this implies that f is closer to being constant on the parts of the refined partition of V [n] .  We first observe that this claim implies Lemma 5.2.

Improving correlation through refinements
In this section we prove Theorem 4.3, thereby completing the proof of our main theorem. It will be convenient to use the following corollary of Theorem 2.1, showing that if two functions f 1 , f 2 span very few edges, then one of them must be concentrated on a junta of measure 3/4. Corollary 6.1. For all 0 < ε ≤ 1/2 there exist δ > 0, r ≥ 1, such that the following holds. For all n ≥ 1 and f 1 , and Pr[x ∈ T ] ≤ 3/4. Notice that r 2 depends on ε but not on k. Also note that the factor w k min , which, as we will see in the proof, is incurred by the normalization when moving from V n to V n−k , is the dominant factor in our calculations (as we will need k = Tower(O(log(1/ε)))).
Proof. Fix a function f and a set I ⊆ [n], with |I| ≤ k as in the statement of the lemma. We construct a set S ⊆ V I and sets J x , T x satisfying the requirements in Definition 5.1 as follows. Initially, we set S = / 0. We consider all edges (including loops!) (x 1 , x 2 ) (i.e., all pairs (x 1 , x 2 ) ∈ V I ×V I with w (x 1 ,x 2 ) > 0), in an arbitrary order. For each edge (x 1 , x 2 ) if either x 1 or x 2 is already in S, we continue to the next edge. Otherwise, we apply Corollary 6.1 to the functions f (x 1 , ·) and f (x 2 , ·) with ε taken to be ε/32, resulting in i ∈ {1, 2}, J ⊆ [n] \ I, with |J| ≤ r 1 (ε/32), and T ⊆ V J . We then add x i to S, and define J x i to be J and T x i to be T . This completes the description of the construction. Notice that we are allowed to apply Corollary 6.1 above with parameter ε set to ε/32, since Moreover, notice that S forms an independent set (since S must contain at least one vertex of each edge), and since f is matching-like, we have Finally, for all x ∈ S we have by (9) that Pr[y ∈ T x ] ≤ 3/4, and by (8) that We conclude that S, {J x }, and {T x } satisfy the requirements in Definition 5.1, as required. Proof of Theorem 4.3. Given a matching-like f , with E[ f ] = ε, let r = r 2 (ε) be as given by Lemma 6.2. Apply Lemma 5.2 with f and r to get a set J of cardinality at most k = k(r, ε) = Tower(O(log(1/ε))) such that the correlation of J with f cannot be substantially improved by r-refinement. From Lemma 6.2 it follows that f , A f > δ , with δ = δ 3 (ε, k) = (Tower(O(log(1/ε)))) −1 .

Kneser graphs are like Swiss cheese
In this section we prove Theorem 7.2, which extends our main theorem to the case of Kneser graphs. Fix 0 < p < 1/2. We will consider the Markov chain on {0, 1} n which moves independently on each coordinate according to the transition matrix The stationary measure of this Markov chain is the product measure µ p = (1 − p, p) ⊗n , and all transitions (x, y) have probability 0 if x and y are not disjoint. If x and y are disjoint, then the weight of the edge (x, y) is precisely p |x| p |y| (1 − 2p) n−|x|−|y| . So, for two disjoint sets x, y ⊂ [n], we define µ p,p (x, y) := p |x| p |y| (1 − 2p) n−|x|−|y| .
Recall the following notation. Given a set of coordinates J ⊂ [n], and two vectors w ∈ {0, 1} J , and x ∈ {0, 1} [n]\J , we will write (w, x) for the element of {0, 1} n formed by merging them appropriately. The following, then, is a special case of Corollary 3.2.
We will show how to deduce Theorem 7.2 from Theorem 7.1. First, we need two lemmas, regarding moving from functions on a single layer to functions on the whole cube, and vice versa.
Proof. For any f and the corresponding g, where the sum is over x ∈ {0, 1} [n]\J of size precisely k − |w|. Then, for sufficiently large n, Proof. Throughout this proof, when summing over x, we are restricting ourselves to the case |x| + |w| ≥ k, since other values of x contribute nothing. Observe that where in the last equality we reversed the order of summation. Since the first sum in Eq. (10) is precisely V w ( f ), it suffices to show that the second sum (which only depends on p, |J|, and |w|) is at least 1/5. To this end observe that k+i tends to 1/2 by the central limit theorem. Next, for any i < log(n) np(1 − p) we have ((n−k)−(|J|−|w|)) i (n−k) i ∼ 1 and in particular, for sufficiently large n, and i in that range, (1)).
Using the up-lemma and the down-lemma we now deduce Theorem 7.2 from Theorem 7.1.
Proof of Theorem 7.2.
Let p and f be as in the statement of the theorem, let ε ≥ 0 and assume n is sufficiently large and Edge( f ) ≤ δ (ε), where δ (ε) is as defined in Theorem 7.1. Let g be as given by the up lemma, Lemma 7.3. Then Edge(g) ≤ Edge( f ) ≤ δ (ε). Now, invoke Theorem 7.1 to produce J ⊂ [n] and an intersecting family T ⊂ {0, 1} J which captures g, i.e., By the down lemma, Lemma 7.4, for every w ∈ {0, 1} J (and specifically for w ∈ T ) we have as required.
Appendix A Theorem 2.1 Theorem 2.1 is basically the main result of [3], apart from some minor differences, the most significant of which being that we improve the quantitative dependence of the parameters (namely, the functions δ 1 and j 1 ) using the work of Dinur and Shinkar [5]. For the reader's convenience, we include a proof sketch in Section A.1. Alternatively, we now explain how to derive Theorem 2.1 from the original statement in [3], which now follows.
Theorem A.1 ([3, Theorem 1.1 + 2nd and 4th remarks there]). For all ε > 0 there exist δ > 0, j ≥ 1, such that the following holds. For all n ≥ 1 and f : The differences between this and our Theorem 2.1 are as follows. First, our theorem considers functions with range [0, 1] as opposed to {0, 1}. The proof in [3] actually applies to the more general case, as is easy to check. Alternatively, one can derive the more general case from the restricted one by replacing a function f : V n → [0, 1] with the function f : V n+m → {0, 1} where we define f (x, y) to be 1 with probability f (x) and 0 otherwise, independently over all x, y. Then as m goes to infinity, f , A f converges to f , A f and similarly for the other expressions appearing in the theorem.
A second difference is that our theorem involves two functions f 1 , f 2 as opposed to just one as above. The proof in [3] can easily be modified to handle this. Alternatively, as before, we can derive this from the original statement as follows. Let a 1 , a 2 be two elements of V 2 that are connected by an edge and have no self loops. (Such two elements must exist unless we are in the case in which there is a loop on all vertices in V , which means there are no non-empty independent sets in any power of V , so this case is irrelevant for our current discussion.) Then given f 1 , a = a 2 , and 0 otherwise. Then f , A f = 2w f 1 , A f 2 where w is the weight of the edge connecting a 1 to a 2 .
The final and most significant difference is that Theorem A.1 does not explicitly specify the dependence of δ and j on ε. Inspecting the proof in [3] reveals that the dependence is superpolynomial. By using an improvement by Dinur and Shinkar [5] of the technical statement from [4], we are able to obtain a polynomial dependence of the parameters, as stated in Theorem 2.1. We remark that this improvement has no noticeable effect on the final bound in our main result and we could have used the original bound implicit in [3]; we decided to include the improvement as it might be useful for future work.
In slightly more detail, the parameters in the proof of Theorem 1.1 in [3] all depend polynomially on the functions τ MOO and δ MOO defined in Theorem 2.2 there. Those functions can be taken to be polynomial, as shown in the following lemma.

A.1 Proof of Theorem 2.1
Here we include a proof sketch of Theorem 2.1, closely following the original proof in [3] and occasionally borrowing from the notation there.
In particular, the number of variables that have influence at least τ on N η f is at most (1 − η 2 ) −2 /τ. . Let λ = λ (A) < 1 be the second absolute eigenvalue of A, and let 1 − λ < η < 1 be sufficiently close to 1 so that Then for any f 1 , Proof. By decomposing the functions according to the eigenbasis of A, A straightforward calculation (see [3,Lemma 2.5]) shows that the above maximum is at most √ 1 − η. We can therefore complete the proof by noting using Cauchy-Schwarz that ∑ S |f 1 (S)f 2 (S)| ≤ f 1 2 f 2 2 ≤ 1.
For i ∈ {1, 2}, define g i = N η f i . Let j be the number of variables with influence greater than γ on either g 1 or g 2 , and assume without loss of generality that these are the variables J := {1, . . . , j}. By Claim A.3 there are at most 2(1 − η 2 ) −2 /γ such variables, so in particular, we can take j = ε −c for large enough c, as required.
For a ∈ V j define g 1,a : V n− j → [0, 1] by g 1,a (x) = g 1 (a, x) and similarly for g 2 . Let and similarly define T 2 with g 2,a . The condition in Eq. (1) now follows from Lemma A.6.
From this it follows that i has influence greater than γ on either g 1 or on g 2 , in contradiction to the definition of J. [