Self-similarity in the circular unitary ensemble

This paper gives a rigorous proof of a conjectured statistical self-similarity property of the eigenvalues random matrices from the Circular Unitary Ensemble. We consider on the one hand the eigenvalues of an $n \times n$ CUE matrix, and on the other hand those eigenvalues $e^{i\phi}$ of an $mn \times mn$ CUE matrix with $|\phi| \le \pi / m$, rescaled to fill the unit circle. We show that for a large range of mesoscopic scales, these collections of points are statistically indistinguishable for large $n$. The proof is based on a comparison theorem for determinantal point processes which may be of independent interest.


Introduction
The set of N × N unitary matrices is a compact Lie group, and as such, possesses a unique probability measure which is invariant under left-and right-translation (called Haar measure). In random matrix theory, the unitary group together with Haar probability measure is called the circular unitary ensemble (CUE). The word circular refers to the fact that all of the eigenvalues of a CUE matrix lie on the unit circle in the complex plane.
There has long been a folklore conjecture that the distribution of the eigenvalues of a CUE random matrix has a self-similar structure. For example, in their statistical analysis [3] of CUE eigenvalues and zeroes of the Riemann zeta function, Coram and Diaconis hypothesized that the following may hold: Conjecture 1. Let U be an N × N random matrix from the CUE with eigenvalues {e iθ j } 1≤ j≤N , where 0 ≤ θ 1 ≤ · · · ≤ θ N < 2π. Choose an eigenvalue e iθ K uniformly, and let T be the length of the counterclockwise circular arc from θ K to θ K+k , where the indices are interpreted modulo N. Let φ ∈ [0, 2π) be a uniformly chosen random angle, independent of U. If k and N are both large, then the random set of points e i φ + 2πθ j T K≤ j<K+k is statistically indistinguishable from the eigenvalues of a k × k random matrix from the CUE.
That is, a random choice of k sequential eigenvalues of an N × N CUE matrix U, rescaled and randomly rotated, is indistinguishable from the full set of eigenvalues of a k × k random matrix.
Aside from statistical evidence for the conjecture, there is a result of E. Rains [15] which is suggestive of this kind of self-similarity. Suppose that U is an N × N random CUE matrix, with N = nk; Rains proved that the distribution of the eigenvalues of U n is exactly that of the collection of eigenvalues of n independent k × k random CUE matrices. That is, wrapping the eigenvalues of U around the circle n times produces n independent copies of the k eigenvalues of a k × k random matrix. It is tempting to view each of those collections of k eigenvalues as coming from one of the n arcs of the circle that gets stretched to cover the circle once (this is not at all the way Rains' theorem is actually proved). If this intuition were correct, it would illustrate exactly the kind of self-similarity conjectured by Coram and Diaconis.
In this paper, we give a rigorous proof of a version of the self-similarity conjecture. The following notation is used throughout. Let U be an n × n random CUE matrix with eigenvalues {e iθ j } 1≤ j≤n , with θ j ∈ [−π, π) for each j. (It is a matter of technical convenience to take the arguments of the eigenvalues to be in [−π, π) here instead of in [0, 2π) as in Conjecture 1.) For A ⊆ [−π, π), N n,A denotes the number of eigenangles θ j which lie in A; we generally omit the n and write N A . For θ ∈ [0, π), N [−θ ,θ ] is denoted by N θ . For m ≥ 1, let U (m) be an nm × nm random CUE matrix with eigenvalues {e iφ j } 1≤ j≤nm , with φ j ∈ [−π, π) for each j, and let A counts the random points in A of the point process consisting of the eigenvalues of U (m) in the arc of length 2π m about 1, and rescaling to fill out the whole circle. While the total number of eigenvalues in this arc is random, it concentrates strongly at its expected value of n. In the context of the Diaconis-Coram conjecture, our nm plays the role of N and n plays the role of k.
For context, recall that EN A = n|A| 2π ; the same is true for N (m) A . In the statement of Theorem 2, and all of the following results, precise constants are included for concreteness, with no claims as to their sharpness. The definitions of d TV (·, ·) and W 1 (·, ·) are recalled at the end of this section.
Theorem 2 was stated with the implicit assumption that m is an integer, since it is in that case that it relates directly to Conjecture 1. However, it is only strictly necessary that mn is an integer, and a slight refinement of the proof shows that for any m ≥ 1, if mn ∈ N, then In particular, this yields the comparison between n × n CUE eigenvalues and (n + 1) × (n + 1) CUE eigenvalues.
As a consequence of Theorem 2, if {A n } is a sequence of sets such that either diam A n = o(n −1/4 ) or |A n | = o(n −1/2 ) as n → ∞, then Thus indeed, a sequential arc of about n of the nm eigenvalues of an nm × nm random matrix is statistically indistinguishable, on the scale of o(n −1/4 ) for diameter or o(n −1/2 ) for Lebesgue measure, from the n eigenvalues of an n × n random matrix.
A remarkable feature of Theorem 2 is that it yields microscopic information even at a mesoscopic scale: if 1 n diam A n 1 n 1/4 , then N A n and N (m) A n both have expectations and variances tending to infinity (as follows from Lemma 7 below). One would thus typically try to understand the point processes at these scales by studying statistical properties of the recentered and rescaled counts, rather than try to observe individual points. Here, we are able to make direct point-by-point comparisons of the two point processes treated as discrete objects, with no rescaling or continuous approximations.
The fact that we are able to compare the two point processes with no rescaling certainly suggests that we are witnessing a true self-similarity phenomenon which is a special feature of the structure of the eigenvalues of CUE random matrices. However, one should be careful to check that the two point processes are not similar simply because they have the same limit. Indeed, Wieand [19] and Soshnikov [17] showed that as n → ∞ for fixed θ ; the same then follows for N (m) θ . Figure 1 gives a convincing visual illustration that N A and N

(m)
A resemble each other more closely than either resembles a Gaussian distribution; a rigorous proof of this fact is given in Proposition 8 below.
We conjecture that a comparable result to Theorem 2 holds without the restriction on diam A, and that the factor of √ n in the right hand side is an artifact of our proof; this would imply that N A n and N (m) A n become indistinguishable as long as |A n | → 0. For more details, see the remark at the end of Section 2. On the other hand, we do not expect such a result to hold for sets of constant size; i.e., independent of n. For example, Rains [14] gives precise asymptotics for Var N θ for n → ∞ and θ fixed, which show that Var N θ and Var N  20 π ≈ 6.4 and 125 π ≈ 39.8, respectively) and variance equal to the average of the two corresponding sample variances.
Theorem 2 does not hold in this setting. Rains' estimate does show that Proposition 9 below on the asymptotic equality of variances does not extend to that regime.
We expect that a version of Theorem 2 holds for the other circular ensembles of random matrix theory; however, our approach is via the determinantal structure of the eigenvalue process for the CUE, which is not present outside the unitary case.
Finally, some comments on the relationship between Conjecture 1 and Theorem 2 are in order. The models of self-similarity being used are not identical; in Conjecture 1, exactly k + 1 sequential eigenvalues are selected and stretched as needed to make the first and last meet, resulting in exactly k random points. In Theorem 2, the eigenvalues from an arc making up a fixed fraction of the circle are chosen and that arc is stretched (deterministically) to cover the whole circle; the resulting total number of points is random. However, in the mesoscopic regime the two models are essentially the same. The idea is the following: eigenvalue rigidity (see Lemma 10 of [12]) implies that the difference between the j th and the ( j + n) th eigenangles of an nm × nm CUE matrix is about 2π m + O √ log n n with high probability. So whereas Theorem 2 considers the eigenangles of an nm × nm matrix in an interval of length θ /m, Conjecture 1 suggests considering the eigenangles in an interval whose length is random but typically log n , then with extremely high probability an interval of length θ √ log n n contains no eigenangles, and so the corresponding counts are the same.
The rest of this paper is organized as follows. In Section 2 we give the background and general results on determinantal point processes needed to prove Theorem 2, followed by the proof of the theorem and a corollary giving a rate for the classical convergence of the eigenvalue process to the sine kernel process on a microscopic scale. In Section 3 we give precise asymptotics for the variances of the counting functions. As a consequence, we are able to identify a sharp rate of convergence in the central limit theorem mentioned above, which is in particular much slower than the merging of distributions in Theorem 2. We also show that the variances of the counting functions of the two processes are asymptotically equal throughout the entire mesoscopic regime, giving a rigorous proof of another manifestation of the self-similarity phenomenon. Finally, Section 4 gives a surprising comparison between the joint intensities of the eigenvalues processes for U and U (m) .
We conclude this section with a brief review of the notions of distance used here. The following distances can be defined much more generally, but for our purposes, it suffices to define them for integer-valued random variables X and Y .
where the infimum is over random vectors (Z 1 , Z 2 ) such that Z 1 has the same distribution as X and Z 2 has the same distribution as Y (such a random vector is called a coupling of X and Y ).
The Kantorovich-Rubenstein Theorem states that W 1 can equivalently be defined as where the supremum is over 1-Lipschitz functions f : Z → R. The distance W 1 is a metric for the topology of weak convergence plus convergence of absolute first moments. (See [18, Section 6] for a thorough discussion and proofs.) Note that an indicator function of a set A of integers is 1-Lipschitz on Z, and so for X and Y integer-valued, 2 Determinantal point processes and the proof of Theorem 2 Let Λ be a locally compact Polish space. A simple point process on Λ is a random integer-valued (positive) Radon measure χ on Λ, such that the measure of any singleton is at most 1. Alternatively, it may be viewed as a locally finite random set of points in Λ; if A ⊆ Λ then we write N A = χ(A) for the (random) number of points lying in A. If Λ is equipped with a reference Borel measure µ, then the k th joint intensity or correlation function ρ k : Λ k → [0, ∞) of χ is defined by the equation whenever A 1 , . . . , A k ⊆ Λ are measurable and pairwise disjoint, assuming that such functions exist. A simple point process is called a determinantal point process with kernel K : Λ 2 → C if its joint intensities exist and . Note that it is immediate from the definition that the restriction of a determinantal point process on Λ to a measurable subset D ⊆ Λ is again a determinantal point process.
A kernel K : Λ 2 → C defines an integral operator on L 2 (µ) by if K(x, y) = K(y, x), then the operator K is self-adjoint. It was proved by Macchi [11] and Soshnikov [16] that a kernel K which defines a self-adjoint, trace class operator K as above is the kernel of a determinantal point process if and only if all of the eigenvalues of K lie in [0, 1].
For the remainder of this paper, χ will denote the point process of eigenvalue angles in [−π, π) of an n × n CUE random matrix. For fixed m ≥ 1, let χ (m) denote the point process obtained by multiplying by m those eigenvalue angles of an nm × nm CUE random matrix which lie in − π m , π m . It is a fact originally due to Dyson that χ is a determinantal point process on [−π, π); it follows easily that χ (m) is as well. The following Proposition gives explicit formulae for the corresponding kernels. Proof. The case m = 1 was proved by Dyson in [5] (although that work predates the language of determinantal point processes); see also [13,Section 11.1] or [10,Section 5.4]. The general case follows from a change of variables which shows that K  Proof of Proposition 4. By (2), it suffices to prove the second inequality. Let {λ j } and { λ j } be the eigenvalues, listed in nonincreasing order, of the integral operators K and K with kernels K and K respectively. Since N, N ≤ N, by Lemma 5, λ j = λ j = 0 for j > N. Let {Y j } N j=1 be independent random variables uniformly distributed in [0, 1]. For each j, define Through Lemma 5, this gives a coupling of N and N, and so (4) By the Hoffmann-Wielandt inequality [9, Theorem II. 6.11], where · H.S. denotes the Hilbert-Schmidt norm. The result now follows from the general fact that the Hilbert-Schmidt norm of an integral operator on L 2 (µ) is given by the L 2 (µ ⊗ µ) norm of its kernel (see e.g. [20, p. 245]).
We are now in a position to prove the main theorem.

Thus by Propositions 3 and 4,
The refinement (1) of Theorem 2 follows by using a higher-order Taylor expansion in (5). Both N A and N (m) A satisfy central limit theorems in the mesoscopic regime (see Proposition 8 and the remark which follows). We show in the next section that Theorem 2 does indeed describe a non-trivial self-similarity phenomenon on a mesoscopic level, which is not the result of both processes having the same limit.
In the microscopic regime, one can say more. As was first observed in [5], and more clearly spelled out in [13], the kernel K (1) n = K n has the following microscopic scaling limit: The same microscopic scaling limit appears for bulk eigenvalues of certain Hermitian random matrices as well; see [1,4,13]. There is sufficient uniformity in the convergence in (6) to imply that the point process χ, rescaled to lie in [−n/2, n/2), converges as n → ∞ to an unbounded point process on R which is determinantal, with the right hand side of (6) as its kernel with respect to Lebesgue measure. This process is called the sine kernel process; we denote by S A the number of points of the sine kernel process which lie in A ⊆ R. In particular, by e.g. [1,Lemma 4.2.48], N n, 2π n A ⇒ S A . A limited version of Theorem 2 can be deduced from the convergence to the sine kernel process. On the other hand, Theorem 2 actually improves on the classical microscopic result by estimating a rate of convergence, as follows. Proof. Let n be large enough that A ⊆ − n 2 , n 2 and diam A ≤ n 2 . Let k ≥ 0. Recall that by definition of χ (2) , N (2) 2 k n, 2π 2 k n A = N 2 k+1 n, 2π 2 k+1 n A .
It thus follows from Theorem 2 (with m = 2 and 2 k n in place of n) that Fixing M ∈ N and applying this estimate for each k ∈ {0, . . . , M − 1} then gives that As was discussed above, it is well known that N 2 M n, convergence, and so W 1 N 2 M n, 2π 2 M n A , S A → 0 as M → ∞. Thus taking the limit M → ∞ in (7) yields Remark. The application of the Cauchy-Schwarz inequality in the last step of (4) in the proof of Proposition 4 above is the source of the factor of √ n in the statement of Theorem 2, which we conjecture to be unnecessary. A direct estimate of the quantity which is bounded by the trace class norm of the difference K− K, could potentially avoid that dimensional factor, thereby increasing the size of the mesoscopic regime in which Theorem 2 gives non-trivial information. Unfortunately, trace class norms are considerably more difficult to compute than Hilbert-Schmidt norms, and we have not found an estimate which improves on the approach taken above.

Some further asymptotics
The following lemma gives asymptotics for Var N (m) θ in various regimes. As was mentioned in the introduction, the paper [14] gives precise asymptotics as n → ∞ for Var N θ when θ is fixed, but in the present context, estimates for when θ varies with n are needed. Moreover, Consequently, for a sequence θ n ∈ 0, π One of the consequences of the lemma is that it allows us to identify the regime in which the (centered, normalized) counting function has a Gaussian limit, and to provide the estimates of the rate of convergence to Gaussian in that regime given in Proposition 8 below. The real point of the proposition is that the convergence of the centered, normalized counting functions of either point process to a Gaussian limit is much slower than the merging of distributions given in Theorem 2, meaning that the resemblance between N A and N (m) A is emphatically not a consequence of the central limit theorem.
For each n, let θ n ∈ 0, π 2 . The sequence {X n,θ n } converges weakly to the standard Gaussian distribution as n → ∞ if and only if nθ n → ∞. Moreover, whenever 3π n ≤ θ ≤ π, Remark. For m ∈ N fixed, a central limit theorem for N (m) n,θ = N nm, θ m follows immediately from Proposition 8.
Proof. First observe that for any integer-valued random variable X with finite second moment, it follows from Chebychev's inequality that The cumulative distribution function of X thus has a jump of at least 3 16 √ Var X at some integer, and so where Y is a Gaussian random variable with the same mean and variance as N θ . Since N θ is integervalued, together with Lemma 7 this proves both the lower bound in the proposition and the fact that X n,θ n can only have a Gaussian limit if nθ n → ∞.
For the other estimate, the Berry-Esseen theorem (see, e.g., [6, Theorem XVI.5.1]) implies that if By Lemma 5, this may be applied to X n,θ , and so Lemma 7 implies the upper bound in the proposition.
As discussed in the introduction, we conjecture that Theorem 2 holds for any shrinking sequence of sets A n ⊆ [−π, π). We are not able to prove full distributional comparisons for the entire regime; however the following result shows that equality of means and asymptotic equality of variances does hold throughout the entire mesoscopic regime. That is, if {A n } is any sequence of subsets of [−π, π) such that diam A n ≤ π eventually and |A n | → 0 (in particular, if diam A n → 0), then as n → ∞. For context, recall that it has already been shown that Var N θ itself, and thus Var N

Comparison of joint intensities
We conclude with the surprising fact that the joint intensities of the process χ (m) are always larger than those of the eigenvalue process χ; the implications of this observation remain mysterious (at least to us).
Proposition 10. For each m, n, and k, let ρ k : [0, 2π) k → R denote the k th joint intensity of the determinantal point process χ, and let ρ Proof. For this proof we use a different kernel which also generates the point process χ (see [5] or [