A very sharp threshold for first order logic distinguishability of random graphs
- Department of Mathematics, Weizmann Institute of Science
- More about Itai Benjamini
- Department of Computer Science, University of Sheffield
- More about Maksim Zhukovskii
Editorial introduction
A very sharp threshold for first order logic distinguishability of random graphs, Discrete Analysis 2025:6, 17 pp.
Let \(G_1\) and \(G_2\) be two independent random graphs with \(n\) vertices. The probability that they are isomorphic is very small, but they nevertheless share many important properties: to give just one of many possible examples, with high probability they will contain roughly the same number of triangles.
There are certainly some properties that they have a significant probability of not sharing. For instance, with probability 1/2, one of them will have an even number of edges and the other an odd number of edges. However, this property of graphs is quite complicated in that it depends on every single edge (or more precisely, for each pair \(xy\) of vertices the property depends on whether \(xy\) is an edge). Are there properties that depend on far fewer variables but that can still distinguish between \(G_1\) and \(G_2\) with reasonable probability?
The usual way to formulate this question precisely is to ask whether there is a property \(P\) given by a first-order formula in the language of graphs (roughly speaking, that means that the atomic formulae can use equality and the “is an edge” relation, and the formula as a whole is built up from atomic formulae using connectives and quantification) that depends on only a small number of variables, with the property that the probability that \(P(G_1)\) holds and \(P(G_2)\) does not hold is bounded away from 0.
Because first-order quantification yields statements such that they have to hold for all vertices or for at least one vertex, it is hard to avoid properties \(P\) that are true with probability close to 0 or 1. For example, the property
\[\forall x\ \exists y\ \exists z\ xy\in E(G)\wedge yz\in E(G)\wedge xz\notin E(G),\]
which states that every vertex is an end vertex of an induced path of length 2, holds with probability \(1-o(1)\). To see this, note that for each \(x,y,z\), the probability that \(yz\in E(G)\) and \(xz\notin E(G)\) is 1/4, so for each \(x\) and \(y\), the probability that there exists \(z\) with \(yz\in E(G)\) and \(xz\notin E(G)\) is \(1-(3/4)^{n-2}\). Similarly, for each \(x\), the probability that there exists \(y\) with \(xy\in E(G)\) is \(1-(1/2)^{n-1}\). Since these probabilities are exponentially close to 1, the probability that such \(y\) and \(z\) exist for every \(x\) is also exponentially close to 1.
A development of this argument can be used to show that every fixed first-order formula satisfies a zero-one law, in the sense that it holds with probability \(o(1)\) or \(1-o(1)\).
This paper concerns the point at which the argument breaks down, in the following sense. If the formula is not fixed, but instead the number of variables is allowed to depend on \(n\), how many variables are needed for a property \(P\) such that the probability that \(P(G_1)\) holds and \(P(G_2)\) does not hold is bounded away from 0, or, better still, is \(1/4-o(1)\)?
It is relatively straightforward to see that the bound one would expect here is logarithmic. In one direction, if \(k\) is the number of variables, then in order for probabilities such as \((1-2^{-k})^n\) not to be exponentially small, we need \(2^{-k}\) to be comparable to \(1/n\), so we need \(k\) to be at least logarithmic. And in the other direction, the value of \(k\) for which the formula
\[\exists x_1,\dots,x_k\ \bigwedge_{i\ne j} (x_ix_j\in E(G))\]
holds with probability approximately 1/2 is logarithmic in \(n\). (The formula of course states that \(G\) contains a clique of size \(k\).)
The paper proves a much more precise result than this, obtaining the right bound for \(k\) up to an additive constant. In fact, it narrows down the right value of \(k\) to just four integers, namely \(\hat k+1, \hat k+2, \hat k+3\) and \(\hat k+4\), where \(\hat k\) is roughly the integer part of \(\log_2n-2\log_2(\log n)+\log_2(\log 2)\). Previously, the best known upper bound for general \(n\) was of the form \(\hat k+O(\log\log n)\), though in 2003 Kim, Pikhurko, Spencer and Verbitsky obtained an upper bound of the form \(\hat k+O(1)\) that applies to infinitely many \(n\).
The fact that \(k\) must be at least \(\hat k+1\) follows from a simple argument that is similar to the argument for induced \(P_2\)s above, so the content of this result is the upper bound. To obtain an upper bound, one must find a formula in \(k\) variables that holds with probability close to 1/2. (The authors are aiming for the stronger statement where the probability of the event \(P(G_1)\wedge\neg P(G_2)\) is close to 1/4.)
The proof makes heavy use of the second-moment method. It is delicate, because the construction of the formula depends in an important way on whether \(\log_2n-2\log_2(\log n)+\log_2(\log 2)\) is or is not close to an integer. In the simplest case, the authors define a certain monotone decreasing family \(\mathcal F\) of graphs, and \(P(G)\) holds if and only if \(G\) has an induced subgraph \(H\) that is isomorphic to a graph in \(\mathcal F\), together with a vertex \(x\) that is joined to all the vertices in \(H\).