Low-degree learning and the metric entropy of polynomials
- CNRS, Sorbonne Université and Trinity College, Cambridge
- More about Alexandros Eskenazis
- Department of Mathematics, University of California, Irvine
- More about Paata Ivanisvili
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge
- More about Lauritz Streck
Editorial introduction
Low-degree learning and the metric entropy of polynomials, Discrete Analysis 2023:17, 23 pp.
This paper is a follow-up to a paper by the first two authors on the general problem of designing algorithms that can efficiently learn functions defined on the Boolean cube {−1,1}n, given suitable conditions on the functions. Given a function f, the rough idea is to take a series of sample points, calculate f at those points, and guess a function g that is close to f. Obviously, without strong information about f, nothing non-trivial can be done. If, for example, the values of f are chosen uniformly and independently from [−1,1], then one learns nothing from the sample values other than the sample values themselves. However, many functions that one actually wants to learn are far from arbitrary, so one can try to identify suitable properties that make them amenable to learning from a far smaller number of samples.
One set-up considered by the authors is where f:{−1,1}n→[−1,1] is a low-degree polynomial and the samples are taken randomly. Since x2=1 when x∈{−1,1}, every polynomial function f of degree d defined on the Boolean cube has a formula of the form f(x)=∑|A|≤dλA∏i∈Axi, where A ranges over subsets of {1,2,…,n}. The functions x↦∏i∈Axi are the Walsh functions, and the λA are therefore the Fourier coefficients of f that correspond to sets A of size at most d – the size of A is often referred to as the degree of the Walsh function, and one often refers informally to “the Fourier coefficients of degree at most d” or “the degree-d part of the Fourier expansion”.
From simple dimension considerations, one sees easily that one cannot hope to determine a degree-d polynomial f exactly from fewer than ∑dr=0(nr) samples (even if it is required to take values in [−1,1]), so the aim is to determine it approximately. In this paper, and in the previous paper of the first two authors, the approximation considered is an L2 one. Thus, the aim is to find an algorithm that does the following: given a function f:{−1,1}→[−1,1] of degree at most d, it takes m random samples (x,f(x)), and outputs a function g such that ‖g−f‖2<ϵ with probability at least 1−δ. The important question is then how m depends on d,ϵ and δ.
An obvious quantity of interest is the size M(ϵ) of the largest set of degree-d functions f:{−1,1}→[−1,1] that are ϵ-separated. This, the metric entropy of the set of functions under consideration (more precisely, the metric entropy of the set is the dependence of this quantity on ϵ), is in some sense “the size of the set, up to ϵ-approximation”. Heuristically, one can think of each sample as providing a constant number of bits of information, and therefore on information-theoretic grounds one would expect to need at least clogM(ϵ) samples to obtain an ϵ-approximation to f (where c will have some dependence on δ).
Since the space of polynomials of degree at most d has dimension ∑dr=0(nr), which is of order of magnitude nd, one might naively expect logM(ϵ) to be of order of magnitude nd as well. And indeed, algorithms using roughly nd samples were the state of the art for the problem for quite a while. However, it is not obvious how the degree constraint interacts with the constraint that f should take values in [−1,1]: in principle, this could greatly reduce the “effective dimension” of the set. And indeed, this turns out to be the case. The earlier article, quite surprisingly, obtained an algorithm with m=C(ϵ,δ)logn. In broad terms, the authors noted that an inequality of Bohnenblust and Hille implies that the Fourier spectrum of f must be “analytically sparse”, and then they used this sparsity to argue that instead of estimating all the non-zero Fourier coefficients to high accuracy, one can afford to ignore the small coefficients and in this way achieve much better sample complexity.
One of the main purposes of the current article is to prove that the logarithmic bound in the earlier article is sharp, by obtaining a suitable lower bound for the metric entropy M(ϵ). They also show that an exponential dependence of the number of samples on the degree d, obtained in the earlier article, is necessary.
A second purpose is to obtain “robust” versions of the results of the earlier article, meaning that the class of functions to be approximated is enlarged from exact polynomials of degree at most d to approximate ones – that is, functions that are only approximately spanned by the Walsh functions of low degree. The do this by combining a junta theorem of Dinur-Friedgut-Kindler-O’Donnell (roughly, that approximately low-degree bounded functions approximately depend on just a few coordinates) with a general result they prove on the sample complexity of learning approximately low-degree bounded functions over the cube whose Fourier spectrum is concentrated around a small set of frequencies.
The paper concludes with a number of applications of these results.